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1.  INTRODUCTION 

In  this  final  report  we  present  our  work  performed  during 
the  period  June  8,  1979  to  July  31,  1980  in  the  area  of  speech 

compression  and  synthesis.  The  reader  is  referred  to  the 
previous  annual  report  [1]  for  work  performed  between  April  6, 
1978  and  June  7,  1979  under  this  contract. 

In  Section  1.1  we  give  a  very  brief  list  of  the  major 

accomplishments  in  the  past  year.  The  reader  is  referred  to  the 

body  of  the  report  for  details  on  these  as  well  as  other 

accomplishments.  An  outline  of  this  report  is  given  in  Section 
1.2.  In  Section  1.3,  we  give  a  list  of  the  presentations  and 
publications  for  past  year.  The  publications  are  included  in  the 
Appendix . 

1.1  Major  Accomplishments 

a)  Completed  the  labelling  of  the  diphone  data  base  for 

phonetic  synthesis.  The  total  number  of  diphones  is 

now  2733. 

Generalized  the  phonetic  synthesis  program  to  allow  the 
use  of  phonological  rules  combined  with  diphone 
templates . 

Improved  the  algorithms  used  by  the  phonetic  synthesis 
program  for  gain  normalization  and  time  warping. 

b)  Wrote  program  to  interface  the  MITALK  text- to-speech 
program  to  our  diphone  synthesis  program. 
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c)  Developed  the  Harmonic  Deviations  Vocoder  for  LPC 
analysis/synthesis.  This  new  method  models  the  speech 
spectrum  more  accurately  by  compensating  for  the 
spectral  errors  in  the  LPC  spectrum  at  the  pitch 
harmonics . 

d)  Designed  and  implemented  a  phonetic  recognition  program 
that  compares  input  speech  to  a  network  of  diphone 
templates.  This  program  produces  a  sequence  of 
phonemes,  durations  and  pitch  values  suitable  for  input 
to  the  phonetic  synthesis  program.  An  initial  network 
has  been  constructed  from  the  diphone  template  data 
base.  The  program  also  allows  the  user  to  augment  the 
diphone  network  interactively. 

e)  Implemented  extended  addressing  in  the  BCPL  compiler 
and  our  user  programs  under  the  KL-10  TOPS20-Release  4 
monitor.  This  allows  data  structures  of  up  to  8 
million  words  to  be  addressable  by  a  single  user 
process.  This  improvement  enables  the  entire  diphone 
network  to  be  "in  core"  at  all  times. 

f)  Implemented  and  tested  an  embedded-code  multirate 
adaptive  transform  coding  system  capable  of  operating 
at  an  arbitrary  data  rate  in  the  range  2.5  to  9.6  kb/s. 


1.2  Outline 

In  Section  2  we  describe  this  year's  work  on  the  phonetic 
synthesis  part  of  the  very-low-r ate  (VLR)  phonetic  vocoder.  Our 
initial  design  and  implementation  of  the  phonetic  recognition 
part  of  the  phonetic  vocoder  is  discussed  in  Section  3.  Section 
4  contains  a  description  of  our  embedded-code  multirate  adaptive 
transform  coding  system. 
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1.3  Presentations  and  Publications 


During  the  past  year,  we  gave  a  number  of  oral  presentations 
at  the  regular  ARPA  Network  Speech  Compression  (NSC)  Meetings. 
In  addition,  we  made  five  presentations  at  conferences  and  had 
one  paper  published.  These  were: 

a)  M.  Berouti  and  J.  Makhoul,  "An  Adaptive-Transform 
Baseband  Coder,"  Proceedings  of  the  97th  Meeting  of  the 
Acoustical  Society  of  America,  paper  MM10,  pp.  377-380, 
June,  1979. 

b)  J.  Makhoul,  "A  Fast  Cosine  Transform  in  One  and  Two 
Dimensions,"  IEEE  Trans,  on  Acoustics,  Speech  and 
Signal  Processing,  Vol.  ASSP-28,  pp.  27-34,  Feb.  1980. 

c)  M.  Berouti  and  J.  Makhoul,  "An  Embedded-Code  Multirate 
Speech  Transform  Coder,"  IEEE  Int.  Conf.  on  Acoustics, 
Speech  and  Signal  Processing,  Denver,  CO,  pp.  356-359, 
April  1980. 

d)  R.  Schwartz,  J.  Klovstad,  J.  Makhoul  and  J.  Sorensen, 

"A  Preliminary  Design  of  a  Phonetic  Vocoder  Based  on  a 
Diphone  Model,"  IEEE  Int.  Conf.  on  Acoustics,  Speech 
and  Signal  Processing,  Denver,  CO,  pp.  32-35,  April 
1980. 

e)  J.  Makhoul,  R.  Schwartz,  J.  Klovstad  and  J.  Sorensen, 
"Phonetic  Recognition  and  Synthesis  in  a  Total 
Communication  System,"  Presented  at  the  Dallas 
Symposium  on  Voice-Interactive  Systems,  May  1980. 

Copies  of  papers  a)  -  d)  are  given  in  the  Appendix. 
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2.  PHONETIC  SYNTHESIS 

2.1  Introduction 

The  phonetic  synthesizer  is  one  of  the  two  essential 
components  in  our  proposed  verv-low-r ate  (VLR)  speech 
transmission  system  [2].  Operating  at  a  data  rate  of  about  100 
bits  per  second,  the  VLR  vocoder  models  speech  in  terms  of 
phoneme  sized  units.  A  block  diagram  of  this  system  is  shown  in 
Figure  1. 

This  figure  shows  that  input  speech  undergoes  analysis  which 
results  in  a  set  of  phonemes,  phoneme  durations  and  pitches.  A 
phoneme  and  its  associated  value  of  duration  and  pitch  is  called 
a  "triplet".  Speech  rates  are  typically  about  12  phonemes  per 
second,  and  since  each  triplet  can  be  encoded  into  8  bits,  the 
data  rate  in  the  transmission  channel  is  about  100  bits  per 
second.  Once  the  triplets  are  decoded  at  the  receiving  end,  a 
phonetic  synthesizer  reconstructs  the  original  speech.  The 
phonetic  synthesizer  must  have  a  stored  speech  data  base  to 
perform  this  resynthesis.  Furthermore,  the  analysis  program  must 
also  have  access  to  a  data  base  of  phonetically  labeled  speech. 
We  have  designed  our  phonetic  vocoder  such  that  both  the  analysis 
and  synthesis  programs  can  use  essentially  the  same  phonetic  data 
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SPEECH 


FIG.  1.  Block  Diagram  of  the  VLR  Vocoder 
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base.  Once  a  suitable  data  base  has  been  created,  the  phonetic 
synthesis  program  can  produce  speech.  The  phonetic  synthesis 


program: 

a) 

translates 
sequence ; 

the 

input  phoneme  sequence  into 

a  diphone 

b) 

selects 

(depending 

the  most  appropriate  diphone 

on  the  local  phonetic  context) ; 

template 

c) 

time-warps 

each 

of  the  diphone  templates  to 

produce  a 

gain  track  and  14  LAR  parameter  tracks  of  the  specified 
durations ; 

d)  smooths  between  adjacent  warped  diphone  templates  to 
minimize  gain  and  spectral  discontinuities; 

e)  reconstructs  continuous  pitch  tracks  by  linear 
interpolation  of  the  single  pitch  values  given; 

f)  determines  the  cutoff  frequency  and  voicing  using 
knowledge  of  the  phoneme  being  synthesized; 

g)  converts  resulting  LAR  parameter  tracks  to  LPC 
parameter  tracks; 

h)  uses  the  resulting  sequence  of  LPC,  pitch,  gain,  and 
cutoff  frequency  (specified  every  10  ms)  as  input  to 
control  an  LPC  speech  synthesizer. 


2.2  Diphone  Data  Base 

In  this  section  we  briefly  review  what  we  mean  by  a 
"diphone",  and  then  describe  the  efforts  in  labeling  the  data 
base  and  in  developing  a  specification  for  diphone  context. 

2.2.1  The  Diphone  Concept 

The  phonetic  synthesizer  is  designed  around  the  concept  of  a 
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"diphone".  A  diphone  is  defined  as  the  region  from  the  middle  of 
one  phoneme  to  the  middle  of  an  adjacent  phoneme.  As  noted 
above,  the  synthesizer  must  be  provided  with  a  data  base 
containing  any  diphone  which  would  be  needed  to  synthesize  a 
given  utterance.  For  example,  synthesizing  the  word  "fan" 
requires  a  total  of  four  diphones:  [-  F]  ,  [F  AE]  ,  [AE  N]  and 
[N  -]  ("-"  represents  "silence").  The  synthesizer  would  locate 
templates  for  these  diphones  in  its  data  base,  perform  the 
appropriate  concatenation  and  time  warping  of  stored  spectral  and 
durational  parameters,  and  then  synthesize  output  speech.  The 
consequence  of  this  method  is  that  a  fairly  large  number  of 
diphone  templates  must  be  available  to  the  synthesizer  as  a  data 
base.  In  our  vocoder  system,  these  templates  are  extracted  from 
nonsense  utterances  recorded  by  a  single  speaker  in  a  quiet  room. 

2.2.2  Labeling 

By  labeling,  we  are  referring  to  the  task  of  marking 
phonetic  boundary  locations  in  the  nonsense  utterances.  For 
example,  consider  the  nonsense  phrase  "mee  mee  meem" .  This 
phrase  contains  several  instances  of  the  diphones  [M  IYl  and 
[IY  M] ,  and  one  instance  of  the  diphones  [-  M]  and  [M  .  Using 
our  interactive  software  and  display  programs,  a  phonetician 
places  labels  at  any  or  all  of  the  phoneme  boundary  locations  in 
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this  utterance.  The  speech  information  for  diphones  is  then 
appropriately  extracted  from  other  files  and  placed  together  with 
all  other  diphones  to  form  the  data  base. 

A  large  effort  was  invested  this  year  in  hand-labeling  this 
diphone  data  base.  The  initial  labeling  was  completed  about  half 
way  through  the  year,  and  since  that  time  we  have  added  some  new 
diphones  as  well  as  eliminated  a  few  unnecessary  ones.  At 
present,  the  data  base  contains  2733  diphones. 

We  recently  completed  the  addition  of  about  50  diphones 
containing  flapped  "t"  (as  in  "butter" )  in  various  vowel 
contexts.  There  are  likely  to  be  a  few  more  minor  additions  to 
the  data  base  as  we  encounter  diphones  in  particular  phonetic 
contexts.  Appendix  A  gives  a  classification  of  the  data  base  and 
Appendix  B  contains  an  exhaustive  list  of  the  2733  diphones. 

2.2.3  Specification  of  Context 

The  data  base  and  the  phonetic  vocoder  system  are  designed 
so  that  it  is  possible  to  specify  explicitly  the  immediate 
phonetic  context  of  the  diphones.  Consider  the  two  phrases  "gray 
train"  and  "great  rain".  There  is  considerable  variation  in  the 
/t/  in  these  two  cases;  in  particular,  the  aspiration  following 
the  ft/  release  is  longer  when  it  is  found  in  the  /tr/  consonant 
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cluster  and  its  burst  frequencies  are  much  lower.  As  a  result, 
we  must  have  two  [T  R]  diphones  in  the  data  base;  one  where  the 
/tr/  is  part  of  a  consonant  cluster  and  another  where  the  /t/  and 
ft/  are  separated  by  a  syllable  boundary.  In  the  case  of  the 
intervening  boundary,  the  diphone  is  designated  as  [T#R] ,  and  in 
the  consonant  cluster  as  [T  R] . 

The  phonetic  vocoder  also  makes  use  of  imol ic i t  context.  By 
judicious  choice  of  the  inventory  of  phonemes,  we  can  in  some 
cases  eliminate  the  need  to  specify  explicit  context.  For 
example,  most  phonetic  transcriptions  of  the  word  "here"  contain 
the  phonemes  /h/,  /i/,  and  either  /r/  or  an  r-colored  schwa 
vowel.  In  our  vocoder,  the  transcription  is  [H]  followed  by  the 
vowel  [IR].  The  fact  that  the  /i/  occurs  in  the  context  of  a 
following  /r/  is  therefore  taken  into  account  simply  by  calling 
/i/  followed  by  /r/  a  separate  phoneme,  rather  than  by  labeling 
the  /i/  as  having  occurred  in  the  context  of  a  following  /r/. 
The  phonetic  inventory  also  contains  four  other  r-colored  vowels 
(e.g.,  as  in  there,  fa£,  poor ,  and  four ,  written 
[EYR,  AR,  UR,  OR].  "or",  respectively). 

2.2.4  Rule-Based  Synthesis 

There  are  occasions  when  it  is  advantageous  to  synthesize 
particular  phonemes  by  rule  rather  than  by  storing  and  recalling 
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parameters  of  real  speech  in  the  data  base.  One  such  case  is  the 
glottal  stop  phoneme,  as  in  between  the  two  words  "three  eagles". 
The  major  acoustic  manifestation  of  the  glottal  stop  is  a  sudden 
drop  followed  by  a  gradual  rise  in  both  pitch  and  energy.  We 
have  implemented  such  a  rule  in  the  synthesizer,  which  eliminates 
the  need  to  have  diphones  containing  glottal  stops  in  the  data 
base . 


Another  rule  which  was  recently  implemented  was  to  lower  the 
energy  during  the  phoneme  "silence"  to  a  low  level.  We  found  low 
level  noise  existing  in  our  "synthetic  silence",  and  since  it  was 
supposed  to  be  silent,  the  obvious  solution  was  to  force  it  to  be 
so.  Both  of  these  rules  have  been  tested  and  are  well  received 
by  listeners. 

2.3  Diphone  Testing 


In  order 

to  test  both  the  synthesizer  and 

its 

data  base 

of 

diphones,  we 

have  carried  out  a 

process 

of 

testing 

and 

evaluation . 

Two  basic  methods  are 

used . 

The 

first  is 

to 

generate  nonsense  utterances  from  a  set  of  rules  contained  in  the 
synthesis  program.  The  second  is  to  synthesize  a  complete 
sentence  by  inputing  a  phonetic  transcription  of  an  actual 
sentence,  and  then  comparing  the  synthetic  output  to  the  original 
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utterance.  These  methods  of  testing  are  described  below. 

2.3.1  Automatic  Sequence  Generation 

As  described  in  QPR  #5  [3],  one  of  the  automatic  sequence 

generators  in  the  synthesizer  produces  nonsense  utterances  of  the 
form: 

CVC  -  VCV  -  CVCVCV, 

where  C  and  V  are  a  given  vowel  and  consonant.  Once  a  vowel  is 
specified,  the  program  synthesizes  the  utterance  with  all 
possible  consonants  in  place  of  C.  These  utterances  are 
subsequently  played  back  for  evaluation  by  listeners.  In  this 
manner,  we  can  test  any  vowel-consonant  or  consonant-vowel 
diphone  contained  in  the  data  base. 

Testing  consonant-consonant  diphones  is  accomplished  by 
generating  sequences  of  the  form: 

clvc2  clvc2' 

For  a  given  consonant  and  vowel  V,  all  possible  consonants  C2 
are  used  to  generate  a  group  of  utterances.  The  synthesizer  also 
allows  a  phoneme  sequence  to  be  typed  in  directly  from  a 
keyboard,  if  a  specific  phoneme  string  is  desired. 

These  sequence  generators  and  the  option  for  keyboard  input 
allow  us  to  test  the  diphones  in  an  exhaustive  fashion.  We  have 
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completed  the  testing  of  the  consonant-consonant  diphones,  and 
have  begun  testing  the  consonant-vowel  diphones.  This  testing 
has  allowed  us  to  isolate  several  program  bugs  in  the  diphone 
concatenation  and  interpolation  routines  (See  below) . 

2.3.2  Complete  Sentences 

The  synthesis  of  complete  sentences  involves  sending  the 
phonetic  transcription  of  a  particular  sentence  (in  the  form  of 
triplets,  each  comprising  a  phoneme,  a  duration  and  a  pitch 
value)  to  the  synthesizer,  and  listening  to  the  resultant  speech. 
About  thirty  sentences  have  been  synthesized  with  this  method  of 
input . 

2.3.3  Debugging 

Since  exhaustive  testing  of  the  diphones  began  about  the 
middle  of  the  contract  year,  we  have  been  able  to  uncover  and 
correct  a  number  of  program  bugs  in  the  synthesizer.  After  much 
searching,  we  recently  found  the  cause  of  sudden  pops  at 
unvoiced-voiced  boundaries,  especially  in  the  case  of  strident 
fricatives  such  as  "s"  or  "sh",  where  there  is  a  large  amount  of 
energy  in  the  unvoiced  region.  Briefly,  one  of  the  programs  that 
extracts  the  waveform  parameters  that  are  compiled  into  the  data 
base  was  performing  incorrectly  at  the  boundary  between  unvoiced 
and  voiced  segments.  In  QPR  #7  [7],  we  mentioned  we  had  tried  to 
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solve  this  problem  by  relabeling  the  utterances  which  contain 
these  diphones,  but  the  newer  solution  is  clearly  the  correct 
one.  A  number  of  errors  have  also  been  corrected  in  the  diphone 
concatenation  routines,  which  have  resulted  in  a  more  acceptable 
sounding  synthetic  speech  output. 

At  the  present  time  the  synthetic  speech  sounds  quite 
natural.  As  the  synthesizer  is  used  in  the  phonetic  vocoder  we 
will  occasionally  find  and  solve  minor  problems  with  the 
synthesis  to  further  improve  the  quality. 

2.3.4  Algorithm  Refinements 
Gain  Normalization 

As  was  mentioned  in  QPR  No.  3  [4]  and  in  the  last  Annual 
Report  [1] ,  each  group  of  recorded  diphone  utterances  has 
associated  with  it  two  normalization  utterances.  The  synthesizer 
interpolates  between  the  levels  in  these  two  normalization 
utterances  to  compute  a  normalization  level  for  each  diphone 
utterance.  The  synthesizer  has  been  changed  to  use  this 
normalization  level  to  set  the  level  of  each  diphone  template. 
That  is,  the  synthesizer  amplifies  those  diphones  with  a  lower 
normalization  level  under  the  assumption  that  the  speaker  was 
talking  at  a  lower  overall  level  for  those  diphones. 
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We  have  mainly  been  testing  the  diphones  by  synthesizing 
nonsense  sentences.  However,  the  complete  sentences  that  we 
synthesized  seemed  to  have  no  problems  with  inappropriate  levels 
similar  to  the  problems  we  had  experienced  before. 

Time  Warping 

Two  changes  were  made  to  the  time  warping  algorithm  in  the 
synthesis  program.  As  discussed  in  the  various  QPRs ,  the 
time-warping  algorithm  treats  the  diphone  template  as  being  made 
up  of  elastic  and  relatively  inelastic  regions.  For  example,  the 
region  of  the  diphone  template  immediately  surrounding  the 
phoneme  boundary  is  not  changed  by  as  much  as  the  middle  region 
of  the  phoneme.  The  formula  that  had  been  used  previously 
resulted  in  unreasonable  time  warping  when  the  diphone  template 
was  many  times  longer  than  the  desired  phonemes.  Consequently, 
the  formula  for  the  time  warping  factor  during  the  inelastic 
region  has  been  changed. 

Figure  3  shows  the  formula  for  the  stretch  factor  during  the 
inelastic  region  as  a  function  of  the  stretch  factor  during  the 
elastic  region.  The  stretch  factor  is  a  number  that  when 
multiplied  by  the  template  duration  gives  the  resulting  duration. 
So  if  the  stretch  factor  is  2  during  the  elastic  region,  that 
region  is  stretched  to  twice  its  original  length.  As  can  be  seen 
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a 


Stretch  Factor  (Elastic  Region)  SF  . 

el 


FIG.  2.  New  Two-Region  Formula  for  Non-Linear  Time  Warping 


in  the  figure,  when  the  stretch  factor 
greater  than  1  (i.e.  the  template  must 
required  duration)  the  inelastic  region 
to  1.  The  formula  in  this  region  is 

SFinel  "  SFel  ^ 


in  the  elastic 
be  stretched  to 
stretch  factor 

+  B 


region  is 
match  the 
is  closer 


where  SFel  and  SFinel 


are  the  stretch  factors  in  the  elastic  and 
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If  the  stretch  factor  is  less  than  1,  then  the  formula  is  a 


auadratic : 


SFinel  =’3  SF;i+  <1+b>  SFel 


This  particular  quadratic  is  used  because  its  derivative  is  the 


same  as  that  in  the  previous  formula  at  the  boundary  (SFel-i' 


The  program  is  given  the  durations  of  the  elastic  and 
inelastic  regions  in  each  half  of  the  template  (each  half  of  the 
template  corresponds  to  half  of  one  phoneme)  and  the  required 
total  duration  for  that  half  of  the  phoneme  in  the  synthesized 
speech.  It  then  solves  for  the  stretch  factors  using  the  two 
formulae  given  above.  We  have  found  that  this  new  procedure  has 
resulted  in  the  elimination  of  some  of  the  unnatural  transitions 
that  were  present  previously. 


A  second  change  to  the  time  warping  formula  involved  the 
effect  of  speaking  rate  on  the  pronunciation  of  the  diphones.  We 
had  previously  made  the  assumption  that  the  effect  of  speaking 
rate  on  time-warping  of  each  phoneme  was  symmetric  about  the 
middle  of  the  phoneme.  We  have  observed,  however,  another 
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effect.  In  very  slow  speech,  there  tends  to  be  a  relatively  fast 
attack  for  each  phoneme  and  a  slower  decay.  Fig.  4  shows  a 
typical  energy  track  for  a  vowel  spoken  at  two  different  rates. 


* 


FIG.  3.  NON-SYMMETRIC  TIME  WARPING.  a)  Shows  a  typical  energy 
track  for  two  halves  of  a  long  vowel  phoneme 
reconstructed  from  two  diphone  templates.  The  dotted 
line  indicates  where  the  templates  meet  (corresponding 
to  the  middles  of  the  phonemes  in  the  nonsense 
utterances) .  b)  Shows  a  typical  energy  track  for  a 
shorter  vowel  phoneme.  The  middle  of  the  vowel  in  (a) 
is  mapped  (dotted  line)  to  a  point  after  the  middle  in 
(b)  to  produce  the  desired  change  in  the  shape  of  the 
contour . 


As  can  be  seen,  the  longer  version  decays  more  slowly  (even  after 


accounting  for  the  overall  duration  change)  than  does  the  shorter 
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version.  Though  not  shown  here,  the  spectral  parameters  also 
exhibit  the  same  non-svmmetr ic  time  warping  characteristics. 

The  solution  in  the  time  warping  algorithm  has  been  to  map 
the  middle  of  a  long  phoneme  (the  ends  of  two  diphone  templates) 
to  a  point  after  the  middle  of  the  required  phoneme.  Thus,  if 
Fig.  4a  shows  the  energy  track  as  reconstr ucted  by  concatenat ing 
two  diphone  templates,  then  when  shortening  this  phoneme  to  the 
desired  phoneme  duration  shown  in  Fig.  4b,  the  shape  of  the 
contour  on  the  right  side  is  changed  by  mapping  the  template  onto 
the  phoneme  asymmetrically.  This  mapping  (indicated  by  the 
dotted  line  connecting  the  two  energy  tracks  in  Fig.  4)  should 
result  in  more  appropriate  pronunciation  over  a  wider  range  of 
speaking  rates.  We  have  not  yet  fully  tested  this  new  feature. 
Transmitted  Gain  Values 

In  the  original  design  of  the  phonetic  vocoder,  intonation 
was  conveyed  by  the  duration  and  pitch  values.  We  have  found 
that,  occasionally,  these  are  not  sufficient;  often,  the  level  of 
energy  is  inappropriate.  The  problem  is  distinct  from  the  gain 
normalization  discussed  above,  which  is  intended  primarily  to 
adjust  for  the  speaking  level  at  the  time  of  recording. 

We  have  added  the  capability  to  transmit  with  each  phoneme 
(or  alternatively,  with  each  vowel)  an  adjustment  to  the  gain  in 
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the  diphone  templates.  This  adjustment  is  currently  a  one  bit 
code  specifying  whether  the  input  phoneme  is  closer  to  "normal" 
loudness,  or  reduced  in  level  somewhat  (by  5  dB) .  The  number  of 
levels  and  the  magnitudes  of  the  adjustments  in  the  program  are 
easily  modified.  Using  a  one  bit  gain  adjustment  for  each  vowel 
would  add  only  4-5  bits/second  to  the  transmission  rate,  but  is 
likely  to  improve  the  perception  of  intonation.  We  have  not  yet 
had  a  chance  to  measure  the  improvement  in  quality  due  to  this 
add  i  t ion . 

2.4  Text-to-Speech 

During  this  year  we  interfaced  the  MITALK  unrestricted 
text-to-speech  system  to  the  diphone  synthesizer.  Input  to 
MITALK  is  a  typed  text  string,  and  its  final  output  is  a 
digitized  speech  waveform.  MITALK  exists  as  a  number  of 
different  programs  that  are  executed  one  at  a  time  in  sequence, 
with  communication  between  program  modules  occurring  via  ordinary 
text  files.  Thus  it  is  possible  to  examine  the  intermediate 
output  of  the  system  at  any  point  during  processing.  The  last 
program  in  the  sequence  is  a  synthesis-by-rule  module  which 
accepts  input  in  the  form  of  phonemes,  durations,  pitch,  and 
stress  levels.  This  input  is  similar  to  the  input  required  by 
our  diphone  synthesizer. 
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The  major  task  involved  was  then  to  convert  the  MITALK 
string  of  phonemes,  durations,  pitches  and  stress  values  into  a 
form  compatible  with  our  synthesizer.  There  are  differences  in 
phoneme  names,  location  of  phoneme  boundaries,  scaling  of 
durations  and  pitches,  and  speci f icat ion  of  certain  phonemes  in 
certain  contexts.  (For  details  see  QPR7 ) . 

2.5  Harmonic  Deviations 

As  outlined  in  the  proposal,  we  started  to  investigate  the 
feasibility  of  improving  the  spectral  representation  for  LPC 
synthesis  by  the  inclusion  of  the  deviation  of  each  spectral 
harmonic  from  the  smooth  LPC  spectral  model.  In  order  to  test 
out  the  principle  of  harmonic  deviations  we  used  it  in  an 
analysis-synthesis  (vocoder)  environment.  This  work  is  described 
in  detail  in  QPR  No.  5  [3] 

The  basic  idea  used  in  the  Harmonic  Deviations  Vocoder  (HDV) 
is  as  follows.  The  transmitter  extracts  from  the  speech  signal 
and  sends  to  the  receiver  the  parameters  of  a  smooth  speech 
spectral  envelope  model  as  well  as  the  deviation  at  each  harmonic 
frequency  between  the  speech  spectrum  and  the  spectral  envelope 
model.  The  receiver  generates  a  pitch-period  of  the  excitation 
signal  from  a  fixed  frequency-dependent  harmonic  phase  spectrum 
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and  the  harmonic  amplitude  spectrum  computed  from  the  transmitted 
deviations.  The  excitation  signal  is  in  turn  applied  to  the 
filter  corresponding  to  the  spectral  envelope  model  to  produce 
the  output  speech. 

As  part  of  our  phonetic  synthesis  project  we  have  developed 
a  floating-point  non-real-time  simulation  of  an  analysis- 
synthesis  system  that  uses  harmonic  deviations.  In  this 
simulation,  we  do  not  quantize  LPC  parameters,  and  we  employ  the 
first  15  harmonic  deviations.  We  extract  the  harmonic  deviations 
from  the  10-kHz  sampled  speech  once  every  10  ms.  Our  preliminary 
experiments  have  shown  that  the  addition  of  harmonic  deviations 
to  the  LPC  synthesis  significantly  increases  the  naturalness  and 
fullness  of  the  synthesized  speech  while  reducing  the  buzzy 
quality. 
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3.  PHONETIC  RECOGNITION 

As  shown  in  Figure  1,  we  intend  to  use  the  output  of  a 
phonetic  recognizer  to  supply  the  input  to  the  currently 
available  diphone  synthesizer.  Therefore,  the  recognizer  must 
extract  from  an  input  speech  file  a  sequence  of  phonemes,  each 
having  its  own  duration  and  pitch.  We  have  considered  two  main 
approaches  to  recognition:  diphone  template  recognition,  and 
feature-based  acoustic-phonetic  recognition.  The  first  method 
compares  unknown  speech  with  a  large  inventory  of  diphone 
templates  by  the  use  of  a  distance  metric.  The  program  chooses 
as  its  answer  the  sequence  of  phonemes  corresponding  to  the 
sequence  of  diphone  templates  that  matches  the  input  speech  most 
closely  according  to  the  distance  metric.  The  second  method 
involves  the  use  of  a  set  of  analytical  rules  that  are  designed 
explicitly  to  make  some  phonetic  distinction.  Therefore,  these 
rules  can  each  use  specific  knowledge  pertaining  to  a  specific 
phonetic  distinction  -  rather  than  be  restricted  to  a  uniform 
metric.  The  disadvantage  of  using  acoustic-phonetic  rules  is 
that  it  is  hard  to  arrive  at  a  complete  set  of  rules  that  result 
in  consistent  scores.  Also,  it  has  been  our  experience  that 
mistakes  made  by  such  a  program  tend  to  be  more  severe. 
Therefore,  most  of  our  effort  during  this  year  has  been  placed  on 
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the  diphone  template  approach.  We  still  believe,  however,  that 


at  some 

future  date, 

acoustic- phone  tic 

rules  can 

be 

used  to 

disambiguate  between  particular  choices 

suggested 

by 

the  first 

method . 

The 

remainder  of 

this  section  is 

organized 

as 

follows . 

Section 

3.1  contains 

an  overview  of 

the  diphone 

template 

recognition  method  that  we  use.  It  details  the  data  structures 
used  and  the  matching  algorithm  employed  and  discusses  the 
scoring  philosophy  that  motivated  some  of  the  design  decisions. 
In  Section  3.2,  we  enumerate  the  implementation  issues  that  have 
made  the  speed  and  storage  requirements  manageable.  Any 
recognition  system  requires  some  training  or  tuning  in  order  to 
perform  well.  The  different  modes  of  training  that  we  have 
implemented  and  envision  for  the  future  are  described  in  Section 
3.3.  Finally,  in  Section  3.4,  we  report  on  the  testing  and 
performance  of  the  initial  recognition  program,  as  well  as  its 
integration  into  the  phonetic  vocoder. 

3.1  Diphone  Template  Recognition 

A  great  deal  of  thought  has  gone  into  the  design  of  our 
initial  phonetic  recognition  system  using  diphone  templates. 
What  we  hope  to  communicate  here  is  the  kinds  of  issues  that  were 
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considered  and  how  their  consideration  has  influenced  the  design. 
Also  given  is  the  basic  structure  of  each  component  of  the 
phonetic  recognizer  as  it  has  evolved  over  this  first  year  of 
research . 

3.1.1  Motivation  for  Diphone  Template  Recognition 

The  issue  discussed  here  is  that  of  using  diphone  templates 
and  a  matcher  to  do  phonetic  recognition.  There  are  two  main 
factors  that  are  relevant  to  this  decision.  First,  as  in  the 
case  of  phonetic  synthesis,  using  the  diphone  as  a  recognition 
and  synthesis  unit  emphasizes  the  significance  of  the  phoneme 
transitions.  Accurately  modeling  the  phoneme  transitions  is 
important  for  both  recognition  and  synthesis.  Second,  we  know 
that  the  output  of  this  phonetic  recognizer  is  going  to  be  used 
in  conjunction  with  a  phonetic  synthesizer  that  also  uses  diphone 
templates.  Therefore,  there  is  a  strong  motivation  not  only  to 
use  diphone  templates  in  the  recognizer  but  also  to  use  the  same 
set  of  diphone  templates.  That  the  same  set  of  diphone  templates 
should  be  used  for  both  is  motivated  by  the  fact  that  since  the 
recognizer  is  going  to  be  used  in  conjunction  with  the 
synthesizer,  using  identical  diphone  templates  for  both  will,  at 
least,  guarantee  that  the  synthesized  phonemes  are  spectrally 
close  to  the  original. 
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3.1.2  Diphone  Network 

The  method  of  phonetic  recognition  based  on  the  use  of  a 
network  of  diphone  templates  was  described  briefly  in  our 
proposal  [5]  .  Compiling  the  diphone  templates  into  a  network 
implicitly  forces  the  matcher  to  consider  only  sequences  of 
diphones  that  are  consistent.  Since  writing  the  proposal, 
several  of  the  details  have  been  changed.  The  diphone  network 
consists  of  nodes  and  directed  arcs.  An  example  of  a  simple 
network  is  shown  in  Fig.  5.  There  are  two  major  types  of  nodes: 
phoneme  nodes  and  spectrum  nodes .  The  phoneme  nodes  (shown  as 
labeled  circles)  correspond  to  the  midpoints  of  the  phonemes; 
there  is  one  such  node  for  each  phoneme.  These  phoneme  nodes  are 
connected  by  diphone  templates .  Each  diphone  template  is 
represented  in  the  network  as  a  sequence  of  spectrum  nodes  (shown 
as  dots) . 

Each  spectrum  node  consists  of  a  complete  spectral  template, 
as  well  as  a  probability  density  of  the  duration  of  the  node.  A 
spectral  template  is  represented  by  means  and  standard  deviations 
for  all  14  log-area-ratio  (LAR)  coefficients  and  gain.  The 
duration  of  a  node  is  defined  as  the  number  of  frames  of  input 
aligned  with  the  node.  Nodes  are  connected  to  each  other  by 
directed  paths.  When  two  or  more  consecutive  spectra  in  the 
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original  diphone  template  are  very  similar,  they  are  represented 
by  a  single  spectrum  node  in  the  network.  Therefore,  each 
spectral  node  has  an  implicit  self  loop.  The  scoring  used  in  the 
matcher  when  this  self  loop  path  is  taken  depends  completely  on 
the  duration  probability  density  which  has  been  collected  during 
the  course  of  previous  training. 

The  network  representation  as  it  has  been  described  so  far 
would  easily  permit  a  matcher  to  determine  diphone  durations. 
However,  since  the  synthesiser  requires  phoneme  durations,  we 
explicitly  mark  each  diphone  path  in  the  network  at  the  point 
that  corresponds  to  its  phoneme  boundary.  The  open  dots  in  the 
figure  indicate  the  first  spectrum  node  in  the  original  diphone 
template  that  is  at  or  past  the  labeled  phoneme  boundary. 

Several  possible  types  of  network  structures  are  illustrated 
in  the  example  shown.  For  example,  in  Fig.  5  the  diphone 
template  P1-P2  is  distinct  from  the  template  P2-P1.  Also  note 
the  possibility  of  diphones  of  the  type  Pl-Pl.  The  network 
allows  for  two  or  more  templates  going  from  one  phoneme  to 
another  (e.g.,  P2-P1)  .  Branching  and  merging  of  paths  within  a 
template  is  also  allowed  (e.g.,  P1-P3)  .  The  network  allows  the 
specification  of  diphones  in  context.  The  node  P4/&P3  represents 
the  phoneme  P4  followed  only  by  P3.  Thus  the  template  P2-P4/&P3 
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can  be  different  from  the  unconditioned  template  P2-P4.  As 
discussed  in  Section  2.2.3,  we  have  allowed  for  the  distinction 
between  diphone  templates  that  are  within  a  syllable  from  those 
that  cross  a  syllable  boundary.  In  the  figure,  one  of  the 
diphone  templates  from  P2  to  PI  is  marked  as  being  across  a 
syllable  boundary.  This  syllable  boundary  is  indicated  on  the 
spectral  node  marked  as  a  phoneme  boundary  (shown  as  an 

additional  circle  around  the  open  dot).  Finally,  we  allow  for 
the  representation  and  compilation  of  consonant  clusters  as  a 
single  unit  -  even  though  the  cluster  may  be  several  phonemes 
long.  For  example,  the  initial  cluster  [S-SIT-T-R]  as  in 
"string"  is  acoustically  different  from  the  concatenated 

diphones.  Therefore,  the  sequence  is  compiled  as  a  complete 

unit.  This  means  that  the  intermediate  phoneme  nodes  (-SIT-T-) 

are  used  only  in  this  sequence,  and  they  are  not  shared  by  other 
diphones.  Thus,  there  is  no  branching  at  these  nodes.  In  the 
example,  the  cluster  [P1-P5-P3]  is  shown  as  a  separate  path.  The 
phoneme  node  named  P5*  indicates  one  of  these  unshared 
intermediate  nodes. 

Diphone  Network  Compiler 

The  program  that  creates  the  diphone  template  network 
(NETWORK)  takes  as  input  a  text  file  similar  to  that  read  by  the 
synthesis  diphone  template  compiler  (COMPOZ) .  Like  COMPOZ  this 
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compiler  also  permits  the  inclusion  of  incremental  changes.  That 
is,  we  can  replace  diphone  paths,  or  add  additional  optional 
paths  to  the  existing  binary  file  without  rerunning  the  compiler 
on  the  original  diphone  templates.  Not  only  does  this  preserve 
program  compatibility  and  eliminate  the  necessity  for  redundant 
representation  of  the  same  information  but  it  also  greatly 
reduces  the  amount  of  time  necessary  to  produce  a  new  large 
network  if  it  is  only  slightly  different  from  a  previously 
compiled  one. 

Another  feature  of  the  format  of  the  information  compiled  is 
that  statistical  information  can  later  be  added  by  the  matcher 
and  be  available  for  subsequent  incremental  additions.  Both  of 
the  above  capabilities  were  facilitated  by  the  design  and  use  of 
a  memory  management  package.  Thus  memory  freed  by  the  release  of 
a  block  of  data  structure  is  available  for  later  use.  The  memory 
requirements  for  the  network  and  matcher  will  be  discussed 
further  in  Section  3.2.2. 

3.1.3  Matcher 

Briefly,  the  analyzer  chooses  the  sequence  of  templates  that 
best  matches  the  input  speech  according  to  a  distance  measure. 
Since  the  received  speech  is  spectrally  close  to  the  original,  it 
is  hoped  that  this  procedure  will  suffer  minimally  from  phoneme 
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recognition  errors.  The  network  matcher  uses  a  dynamic 

programming  algorithm  which  attempts  to  find  the  sequence  of 
templates  in  the  network  that  best  matches  the  input. 

The  design  of  the  matcher  has  been  strongly  motivated  by  the 
following  4  considerations . 

a)  Sound  Scoring  Strategy 

b)  Continuous  operation 

c)  Alignment  and  training  capability 

d)  Efficiency 

Of  these  four  consider at  ions  the  first  is  perhaps  the  most 
important.  We  feel  that  the  scoring  procedure  should  implement 
(as  accurately  as  possible)  a  well  formulated  scoring  strategy. 
The  scoring  philosophy  requires  that  the  score  have  components 
that  are  derived  from  a  knowledge  of  the  particular  path  chosen 
through  the  network  (a  priori) ,  the  amount  of  speech  aligned  with 
each  particular  spectral  template  in  the  network  (durations)  ,  and 
the  score  derived  from  a  spectral  comparison  between  the  input 
spectrum  and  the  template  spectrum.  The  derivation  of  the 
scoring  strategy  is  discussed  in  detail  in  QPR  #6,  Section  3.1 
f 7]  . 

A  second  major  consideration  is  that  the  matcher  operate 
continuously,  producing  its  output  as  it  receives  its  input  - 
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with  some  delay  -  before  acquiring  all  of  the  input.  Thus  the 
matcher  can  be  thought  of  as  producing  output  as  soon  as  it  has 
what  it  thinks  is  sufficient  evidence  to  make  a  conclusion. 
Operated  in  this  mode,  further  input,  regardless  of  what  it  is, 
cannot  cause  the  matcher  to  change  previously  produced  output. 
This  is  important  because  of  the  intended  use  of  the  recognizer 
in  a  real-time  vocoder  application. 

The  third  consideration,  that  of  permitting  the  user  to 
cause  the  matcher  to  find  the  best  alignment  for  the  "correct" 
phoneme  sequence  and  then  use  that  aligned  input  is  important  for 
the  continuing  training  of  the  recognition  system,  as  well  as  to 
allow  debugging  of  the  recognizer.  These  capabilities  will  be 
discussed  further  in  Section  3.3. 

The  final  consideration  of  efficiency  has  affected  the 
design  of  the  network  and  matcher  in  many  ways.  This  is 
discussed  further  in  Section  3.2.1. 

3.1.4  Search  Algorithm 

The  purpose  of  the  matcher  is  to  find  the  single  path 
through  the  network  that  best  matches  a  sequence  of  input  frames. 
The  program  keeps  a  list  of  partial  "theories"  for  the  best  path. 
A  theory  consists  of  a  detailed  account  of  how  a  sequence  of 
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input  frames  is  aligned  with  the  network,  along  with  a  total 

score  for^  that  correspondence.  For  each  input  frame,  the  matcher 

updates  each  of  the  theories  by  the  addition  of  the  new  frame. 
; 

This  is  illustrated  in  Fig.  6.  Each  theory  spawns  at  least  three 
new  theories:  one  that  corresponds  to  the  new  input  frame  being 
aligned  with  the  same  node  as  the  previous  frame,  one  for  each  of 
the  nodes  immediately  following  that  most  recent  node,  and  one 
for  each  possible  pair  of  two  successive  spectral  nodes.  Thus, 
in  the  example  shown,  the  single  initial  theory  (shown  by  the 
vertical  arrow)  would  result  in  13  new  theories  -  one  at  each 
dot.  After  all  theories  have  been  extended,  if  two  or  more  of 
the  new  theories  arrive  at  the  same  network  node  on  the  same 
input  frame,  only  the  best  scoring  path  is  kept.  Keeping  only 
the  best  scoring  path  constitutes  our  dynamic  programming 
algorithm.  The  remaining  theories  are  then  pruned  back  so  that 
only  the  best  are  kept.  There  are  two  parameters  that  determine 
how  many  theories  are  kept  at  each  step.  The  first  is  an 
absolute  maximum  number  of  theories  (stack  length).  The  second 
is  the  maximum  score  difference  between  the  best  theory  and  the 
worst  theory  kept.  If  the  maximum  score  difference  is  set  to 
infinity,  then  the  search  is  a  "bounded  breadth  search”.  If  the 
stack  length  is  set  to  infinity,  the  algorithm  is  called  a  "beam 
search".  The  most  reasonable  tradeoff  between  speed  and  accuracy 
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is  achieved  when  using  both  of  these  methods  of  pruning. 
Typically,  the  number  of  theories  kept  is  on  the  order  of  1000 
per  frame. 

To  decide  on  the  phonemes  to  transmit,  the  program  examines 
all  the  remaining  theories  after  updating  each  theory  with  the 
newest  input  frame.  If  all  the  theories  have  a  common  beginning 
(in  terms  of  exactly  the  same  alignment  of  the  input  utterance 
with  the  network),  the  phonemes  and  durations  corresponding  to 
this  common  beginning  are  transmi-tted . 

Once  a  phoneme  and  its  duration  have  been  determined,  the 
pitch  is  calculated  using  the  weighted  least  squares  method 
currently  used  to  determine  pitch  values  for  our  diphone 
synthesizer . 

3.2  Program  Efficiency  Issues 

As  mentioned  above,  efficient  operation  of  the  program  is  an 
important  issue. 

3.2.1  Speed 

While  the  simulation  is  not  expected  to  run  in  real  time, 
speed  is  still  important.  If  the  program  takes  several  minutes 
of  CPU  time  to  get  a  result  for  one  sentence,  then  under  a 
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reasonable  computer  load  average  we  must  wait  one  half  hour  to 
see  the  result.  This  reduces  the  number  of  variations  we  can 
try . 


From  the  start,  the  program  was  designed  to  run  as  fast  as 
possible.  However,  the  matcher  algorithm  outlined  above  requires 
a  large  amount  of  processing.  For  each  10  ms  frame,  the  program 
must  extend  each  of  about  1000  theories,  update  them  by  walking 
0,  1  or  2  steps  through  the  network  (resulting  in  about  10,000 
theories),  score  each  of  these  new  nodes  against  the  input  frame, 
keep  a  list  of  theories  sorted  by  score,  keep  track  of  the 
"genealogy"  of  all  the  theories  so  that  it  is  known  when  all 
theories  agree.  Below,  we  list  some  of  the  search,  sorting,  and 
dynamic  programming  techniques  that  were  employed  to  reduce  the 
computation  by  about  4  orders  of  magnitude  from  that  required  for 
the  straightforward  implementation. 

Theory  Merging 

Since  the  program  extends  all  theories  by  the  addition  of 
the  same  input  frame  at  one  time,  all  theories  at  all  times 
include  the  same  input.  This  means  that  the  probabilistic  scores 
on  different  theories  are  directly  comparable.  Therefore,  the 
diphone  network  was  designed  such  that  if  two  theories  merged 
(i.e.  arrived  at  the  same  node  for  the  same  input  frame)  only  the 
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best  scoring  theory  is  considered.  When  a  theory  is  advanced, 
the  program  checks  a  word  in  the  new  network  node  to  see  whether 
this  node  has  been  reached  by  another  theory  in  this  frame.  This 
word  points  to  the  previous  theory  that  arrived  there.  The  two 
theories  can  be  immediately  compared  (based  only  on  previous 
scores  -  without  any  spectral  scoring  at  this  point)  and  the 
lower  scoring  theory  deleted.  Thus,  the  program  need  not  create 
all  the  data  structure  and  score  and  sort  10,000  extended 
theories.  Rather  it  is  more  like  3000.  After  all  theories  have 
been  extended,  the  program  then  can  score  and  sort  the  remaining 
theories. 

Template  Scor ing 

Since  an  important  part  of  the  score  is  based  on  the  number 
of  input  frames  that  match  (are  aligned  with)  a  single  spectral 
node,  theories  that  differ  in  this  respect  must  be  kept  distinct. 
This  results  in  the  same  spectral  node  being  matched  against  the 
same  input  frame  several  times.  For  example,  one  theory  may  have 
looped  on  a  particular  node  two  times,  and  another  three  times. 
When  these  two  theories  are  both  advanced  by  self-looping  again, 
the  program  detects  that  this  particular  score  has  already  been 
computed.  Rather  than  keeping  a  large  matrix  of  score  as  a 
function  of  node  and  input  frame  pair  or,  alternatively,  clearing 
the  score  each  time  it  was  used,  the  program  uses  a  slot  in  each 
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spectral  node  reserved  for  this  purpose.  When  the  spectral 
distance  routine  scores  this  node  against  an  input  frame,  it 
records  the  score  and  a  time  stamp  in  the  node.  If  the  spectral 
distance  routine  notices  that  the  time  stamp  in  the  node  matches 
the  current  time  stamp,  then  it  can  merely  use  the  stored  score 
and  avoid  a  distance  calculation. 

Theory  Sorting 

As  mentioned  above,  the  program  needs  to  know  how  to  keep 
only  the  best  1000  or  so  theories.  One  way  to  do  this  would  be 
to  sort  the  3000  or  so  extended  theories,  and  then  just  keep  the 
best  ones.  However,  since  most  sort  routines  require  computation 
proportional  to  the  square  of  the  length  of  the  list,  the  program 
uses  another  method.  It  uses  a  partially  sorted  binary  tree  of 
theories.  The  head  of  the  tree  always  contains  the  lowest 
scoring  theory.  So  a  newly  extended  theory  can  be  compared 
directly  against  this  theory  to  decide  whether  it  will  be 
retained.  If  so,  then  the  new  theory  ripples  down  the  binary 
tree  until  it  finds  the  correct  spot.  This  requires  only  Log2N 
comparisons.  Thus,  for  N=1000,  the  savings  is  1000/Log2^000=100 . 
Timing  statistics  have  shown  us  that  the  time  required  for  this 
binary  partial  sort  is  now  small  compared  to  the  other  processing 
required  in  the  recognizer. 
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Spectral  Distance  Computation 

A  significant  portion  of  the  computation  was  expected  to  be 
in  the  spectral  distance  computation.  In  addition  to  minimizing 
the  number  of  times  that  the  routine  is  called,  we  would  like  to 
make  the  routine  itself  as  fast  as  possible.  First,  the 
parameters  are  stored  in  fixed  point  format.  (We  did  not  lose 
much  accuracy,  since  the  parameters  were  scaled  up  before 
fixing.)  Second,  a  careful  examination  of  the  compiled  BCPL 
routine  revealed  that  the  routine  could  be  hand  coded  in  assembly 
language  to  be  4  times  faster. 

At  this  point,  timing  measurements  indicate  that  less  than 
10%  of  the  total  CPU  time  is  taken  by  the  spectral  distance 
routine. 

Beam  Search  vs .  Bounded  Breadth  Search 

While  the  bounded  breadth  search  described  above  is 
efficient,  there  are  times  when  the  difference  between  the 
highest  score  and  the  lowest  score  is  very  large.  If  we  put  an 
upper  bound  on  this  difference,  then  frequently  most  of  the 
theories  can  be  eliminated.  One  desirable  feature  of  this  "beam 
search"  is  that  the  number  of  theories  kept  grows  when  the 
decision  is  not  clear  cut.  Also,  it  becomes  possible  to  estimate 
(and  control)  the  probability  that  the  program  will  accidentally 
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eliminate  a  theory  that,  if  kept,  would  have  been  the  best  one. 
Future  Speed-Up 

At  this  point,  the  program  requires  CPU  time  of  about  20-30 
times  real  time  for  a  stack  long  enough  to  assure  finding  a  path 
similar  to  the  best  path.  Since  the  time  is  not  spent  in  the 
sorting  or  scoring,  it  will  be  hard  to  make  large  speed 
improvements.  The  bulk  of  the  time  required  is  probably  taken  up 
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overnight  batch  run.  Also,  we  can  run  the  program  comfortably 
during  the  day  to  test  and  debug  new  features. 

3.2.2  Storage  Requirements 

In  order  that  the  program  operate  quickly,  the  entire 
network  must  reside  in  (virtual)  memory.  If  the  program  had  to 
read  pieces  of  the  network  from  disk  files  as  it  ran,  it  would  be 
several  times  slower.  Therefore,  a  concerted  effort  was  made  to 
keep  the  data  structures  small,  and  share  memory  where  possible. 

The  initial  diphone  template  network  was  compiled  from  the 
2800  or  so  synthesis  diphone  templates.  The  variable-frame-rate 
compression  described  in  Section  3.1.2  reduced  the  network  by  a 
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factor  of  2.5.  The  spectral  parameters  were  quantized  to  9  bit 
fixed  point  values  to  conserve  space  as  well  as  time.  Pointers 
in  the  network  structure  were  packed  two  per  word.  Even  so,  this 
initial  network  just  barely  fit  in  memory  (256K  words)  with  the 
matcher  program.  We  expect  that  the  final  network  will  contain 
several  instances  of  each  diphone.  This  network  could  not 
possibly  (even  with  more  compression)  fit  in  one  PDP10  address 
space . 

We  do  not  view  this  as  a  significant  problem  for  eventual 
implementation,  since  large  memory  chips  are  rapidly  becoming 
inexpensive.  Also,  we  shall  soon  be  using  a  DEC  VAX  11/780 
computer,  which  has  a  much  larger  address  space.  The  personal 
computers  being  designed  at  BBN  also  feature  a  large  virtual 
memory. 

As  a  temporary  solution  to  this  problem,  we  have  modified 
the  BCPL  compiler,  our  memory  management  package,  and  the  user 
programs  in  order  to  take  advantage  of  the  extended  memory 
capability  of  the  KLlO-TOPS20-Release  4  monitor.  This  currently 
allows  us  to  use  up  to  32  full  address  spaces  for  our  program  and 
data.  Since  the  program  uses  the  BCPL  "structure"  declarations, 
switching  from  half-word  pointers  to  full-word  pointers  did  not 
require  as  many  source  code  changes  as  it  would  otherwise  have 
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required . 

3.3  Training 

Each  diphone  can  be  spoken  in  a  variety  of  ways.  Therefore, 
to  correctly  recognize  these  many  variations  as  the  same  phoneme 
strina,  the  program  must  be  trained  with  large  amounts  of  natural 
speech.  There  are  three  different  ways  that  we  have  of  training 
the  network. 

3.3.1  Manual  Training 

The  most  straightforward  approach  to  training  the  network  on 
a  large  variety  of  speech  is  simply  to  label  (manually 
transcribe)  more  speech.  This  transcribed  speech  can  then  be 
processed  by  existing  programs  to  add  alternate  paths  to  the 
network . 

This  method  has  the  advantage  of  simplicity.  There  are 
several  disadvantages,  however.  First,  each  instance  of  a 
particular  diphone  template  would  be  independent.  Therefore,  the 
statistical  information  (a  priori  path  probabilities,  LAR 
standard  deviations,  duration  probability  densities)  for  each 
template  would  be  determined  from  rules  rather  than  trainina. 
That  is,  rather  than  observing  several  instances  of  the  same 
diphone  and  using  the  accumulated  data  to  estimate  prohabilitv 
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densities,  the  program  would  treat  each  instance  as  an 
independent  sample,  and  derive  a  Gaussian  distribution  for  each 
parameter  using  a  fixed  standard  deviation.  Second,  for  those 
diphones  that  are  common,  there  might  be  hundreds  of  examples 
before  the  average  number  of  templates  per  diphone  was  high 
enough.  Third,  the  effort  required  to  accurately  label  several 
hundred  sentences  is  tremendous.  Fourth,  this  method  requires 
that  the  researcher  who  transcribes  the  speech  know  the  best  way 
to  transcribe  each  passaae  (one  must  often  choose  between  similar 
labels)  and  the  best  rules  for  placement  of  phoneme  boundaries, 
as  well  as  be  able  to  use  these  rules  consistently.  These 
problems  could  significantly  affect  performance. 

Cluster ing  of  Diphone  Templates 

The  first  and  second  problems  mentioned  above  can  be 
eliminated  by  writing  a  clustering  program  that  would  start  with 
the  many  instances  of  one  diphone,  and  group  them  such  that  there 
were  only  a  few  templates  for  each  diphone,  with  statistics 
derived  from  the  data.  However,  the  third  and  fourth  problems 
(large  amount  of  work,  decisions  made  by  humans)  would  still 
exist. 


3.3.2  Interactive  Training 

One  of  the  capabilities  of  the  matcher  is  that  the  user  can 
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specify  what  answer  it  should  get  for  a  particular  input  speech 
file.  The  researcher  types  in  a  list  of  phonemes  that  will  be 
required.  If  any  of  the  phonemes  is  in  question  (e.g.  AX  vs.  AH) 
then  the  researcher  can  just  type  "ANY"  which  allows  the  program 
to  use  the  closest  phoneme  at  this  point.  The  user  can  also,  if 
desired,  constrain  the  time  that  the  matcher  assigns  to  the 
beginning  of  any  phoneme.  This  is  done  in  a  flexible  manner, 
using  an  interactive  command  loop.  The  user  constrains  the 
boundary  to  be  either:  before  t,  after  t,  at  t,  or  between  tl 
and  t2.  This  "forced  alignment"  can  also  be  read  in  from  a 
standard  label  file,  so  if  the  training  sentence  has  already  been 
transcribed,  the  user  need  not  type  in  the  required  phoneme 
string  and  times.  The  user  is  provided  with  an  editing  facility 
for  these  lists  so  that  he  may  insert,  delete,  or  replace  items. 
The  user  can  save  alignments  on  a  file  and  later  read  them  in. 
There  are  also  global  commands  that  may  be  used  to  give  the 
matcher  just  the  right  amount  of  flexibility. 

Two  likely  modes  of  interaction  are  outlined  below.  In  the 
first,  the  researcher  has  not  carefully  transcribed  the  training 
utterance.  By  listening  to  it  briefly,  he  makes  a  list  of  the 
phonemes  (leaving  out  those  about  which  he  is  unsure).  Then,  the 
matcher  is  instructed  to  find  the  best  alignment  of  the  input 
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utterance  to  the  network  under  the  constraint  of  the  phoneme 
list.  The  user  compares  the  printed  alignment  visually  with 
parameter  plots  of  the  input  utterance.  If  the  alignment  is 
clearly  wrong  (a  fairly  rare  occurrence)  he  constrains  the  time 
for  the  beginning  of  one  or  two  of  the  phonemes,  and  tells  the 
matcher  to  "do  it  again". 

A  second  likely  mode  is  that  someone  has  already  transcribed 
the  utterance.  The  researcher  reads  in  the  transcription  and 
then,  being  skeptical,  instructs  the  program  to  "fuzz"  the  time 
constraints  by  two  frames  in  each  direction,  so  that  if  the 
transcriber  was  off  by  two  frames,  the  matcher  could  still  find 
the  correct  alignment.  He  then  changes  any  questionable  phoneme 
labels  to  "ANY"  and  instructs  the  matcher  to  find  the  best 
(constrained)  alignment. 

Once  the  user  is  satisfied  with  part  or  all  of  the 
alignment,  he  can  instruct  the  matcher  to  "train"  the  network 
using  the  training  utterance.  The  basic  commands  available  are: 
Add  to  Statistics,  Add  new  path.  Add  new  diphone,  Save  the 
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duration  probability  densities.  The  second  command  (Add  path) 
causes  the  program  to  add  in  the  specified  speech  as  an  alternate 
branch  to  the  section  of  network  against  which  it  was  aligned. 
If  the  spanned  network  contains  a  phoneme  boundary,  the  program 
asks  which  frame  of  the  input  should  be  marked  as  a  boundary. 
The  third  command  (Add  diphone)  lets  the  user  specify  a  whole  new 
template  for  a  named  diphone. 

Using  the  training  commands  described,  the  user  actually 
changes  the  diphone  template  network.  If  he  is  unsure  about  or 
not  satisfied  with  a  part  of  the  alignment,  he  need  not  use  that 
part  for  training.  At  any  point  in  the  process,  the  user  can 
save  the  updated  network  on  a  file. 

It  is  hoped  that  this  procedure,  which  amounts  to 
interactive  clustering,  might  yield  a  set  of  templates  that  is 
somehow  more  self-consistent.  One  drawback  to  this  approach  is 
that  it  is  a  very  slow  process,  requiring  the  researcher  to  spend 
long  hours  correcting  the  matcher  results. 

3.3.3  Automatic  Training 

One  could  easily  imagine  a  range  of  modes  of  training  in 
which  the  computer  can  make  more  of  the  decisions.  For  instance, 
once  the  researcher  has  achieved  the  desired  alignment,  he  could 
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issue  a  command  "Train"  which  would  either  add  to  the  network 
statistics  or  add  new  paths  based  on  a  simple  local  score 


threshold . 
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3.4  Testing  and  Performance 
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phonetic  recognizer,  we  needed  to  test  it  in  a  reasonable 
environment.  This  necessitated  integrating  the  recognizer  with 
the  phonetic  synthesizer  to  complete  the  phonetic  vocoder.  We 
also  knew  that  the  system  would  not  perform  well  without  being 
trained.  These  two  efforts  are  outlined  below. 

3.4.1  Complete  Phonetic  Vocoder 

To  communicate  with  the  phonetic  synthesizer,  the  recognizer 
must  produce  a  sequence  of  triplets,  each  comprising  a  phoneme,  a 
duration,  and  a  pitch  value.  As  described  in  Section  3.1.3,  each 
time  the  matcher  drops  a  theory,  it  automatically  checks  whether 
this  now  means  that  there  is  some  part  of  the  answer  that  is 
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common  to  all  theories.  This  may  happen  one  frame  at  a  time,  or 
sometimes  deletion  of  one  theory  may  eliminate  the  last  conflict 
for  several  phonemes  worth.  In  any  case,  for  each  frame  of  the 
input,  the  program  types  out  (to  a  terminal  or  to  a  disk  file)  a 
symbol  indicating  the  type  of  alignment:  self  loop,  advance  to  a 
new  node,  share  the  input  frame  between  two  nodes,  whether  the 
node  is  marked  as  a  phoneme  or  syllable  boundary.  The  program 

can  also  output  the  various  components  of  the  score  for  each 
input  frame,  as  well  as  the  cumulative  scores.  Also,  whenever 
the  matcher  detects  that  the  best  path  crosses  a  phoneme  node  in 
|  the  network,  it  types  out  the  name  of  the  phoneme  node. 

An  addition  was  made  to  these  tvpeout  routines,  such  that 
they  would  remember  the  names  of  all  the  phoneme  nodes  crossed, 
the  time  of  the  phoneme  boundaries,  and  whether  the  boundary  was 
also  a  syllable  boundary.  The  program  can  then  write  a  standard 
transcription  file  which  can  be  read  in  by  the  phonetic 
synthesizer  (as  well  as  many  other  utility  programs).  The 
synthesizer  then  reads  in  the  pitch  file  for  the  input  utterance, 
and  models  the  pitch  by  a  weighted  piecewise  linear  fit  to 
determine  the  pitch  values.  In  a  real  phonetic  vocoder,  this 
last  function  would  reside  in  the  recognizer. 

Below  we  describe  some  of  the  benchmark  experiments 
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performed . 

3.4.2  Benchmark  Experiments 

The  first,  very  short  experiment  performed,  was  to  try  to 
phonetically  vocode  a  sentence.  We  used  the  untrained  network 
constructed  from  only  the  synthesis  diphone  templates.  66%  of 
the  phonemes  were  correctly  recognized.  However,  there  were  many 
extra  phonemes  in  the  output  string.  Upon  examination  of  the 
sequence  of  phonemes,  we  felt  that  it  would  be  unintelligible 
when  synthesized.  However,  when  we  used  this  output  in  the 
synthesizer,  we  found  the  sentence  somewhat  recognizable,  though 
there  were  problems.  On  playing  the  sentence  to  several  naive 
listeners,  they  understood  most  or  all  of  the  sentence.  This  led 
us  to  believe  that  we  were  correct  in  assuming  that  even  when  the 
phonemes  were  wrong,  the  spectrum  would  be  close  enough. 

At  this  point,  we  wanted  to  measure  the  effect  that  training 
would  have.  However,  we  did  not  want  to  wait  until  the  network 
was  fully  trained  before  testing  it.  Therefore,  we  recorded  the 
first  paragraph  of  the  Rainbow  Passage  two  times  (about  30  sec  or 
320  phones  of  speech  each) .  The  first  reading  of  the  paragraph 
was  manually  transcribed  and  u^ed  to  augment  the  network 
(described  in  Section  3.3.1).  We  then  tried  to  recognize  the 
second  reading  of  the  paragraph.  From  careful  examination  of  the 
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second  reading,  we  felt  that  5%  of  the  phonemes  were  actually 
different,  i.e.,  they  were  pronounced  differently.  This  meant 
that  about  90%  of  the  diphones  in  the  second  reading  of  the 
paragraph  had  at  least  one  template  in  the  network  that  came  from 
natural  speech  -  as  opposed  to  the  2800  diphone  templates 
extracted  from  nonsense  utterances  that  made  up  the  rest  of  the 
network . 

The  result  of  this  small  test  was  that  79%  of  the  phonemes 
in  the  second  reading  were  recognized  correctly.  There  were 
still  some  extra  phonemes  in  the  output.  On  playing  the 
synthesized  speech  resulting  from  this  output,  most  listeners 
understood  most  of  the  sentences.  However,  we  felt  that  both  the 
quality  and  intelligibility  would  need  to  be  improved  before  this 
system  would  be  acceptable. 

The  above  result  is  encouraging.  We  must  remember,  however, 
that  the  experiment  was  optimistic  in  that  the  diphone  templates 
were  extracted  from  the  same  global  environment  as  they  would  be 
used.  We  might  hope  that  the  fact  that  in  the  final  system,  all 
diphones  will  have  several  training  samples,  will  make  up  for  not 
always  having  diphones  trained  in  the  same  context.  The  mode  of 
training  used  in  this  experiment  did  not  permit  the  use  of 
statistics.  Also,  we  would  hope  that  when  the  other  diphones  in 
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the  network  have  been  trained,  they  would  be  less  likely  to  be 
confused  with  the  correct  diphones. 

In  a  third,  short  experiment,  we  carefully  constructed 
sentences  that  contained  a  mixture  of  "trained"  and  "untrained" 
diphones.  One  half  of  the  diphones  had  not  been  trained  on 
natural  speech.  Those  diphones  that  had  been  trained  were  used 
in  a  completely  different  environment  from  that  of  the  training 
sentences.  When  tested  on  the  matcher,  78%  of  the  trained 
diphones  were  recognized  correctly,  while  only  40%  of  the 
untrained  diphones  were  recognized.  Again,  this  test  was  not 
extensive  enough  to  predict  the  final  performance  of  the  system 
after  extensive  training. 

The  above  experiments  tell  us  that  training  of  the  network 
is  very  important.  It  also  tells  us  that  there  is  a  significant 
difference  between  using  natural  speech  versus  nonsense  speech. 
Therefore,  we  may  decide  to  delete  each  nonsense  diphone  template 
from  the  network  as  soon  as  there  is  a  corresponding  template 
from  natural  speech  to  replace  it.  This  capability  has  not  yet 
been  added  to  the  network  compiler  or  matcher,  though  it  would 
probably  be  relatively  straightforward. 
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3.5  Conclusion 

We  have,  during  the  past  year,  designed  and  implemented  an 
initial  version  of  a  phonetic  recogr.iti  n  program  particularly 
suited  to  very-low-r ate  phonetic  vocoding.  The  program  uses  a 
network  of  diphone  templates  for  recognition,  which  makes  it  most 
compatible  for  use  with  the  diphone  template  synthesizer.  The 
program  was  also  designed  to  be  flexible  and  allow  interactive 
training,  so  as  to  be  useful  as  a  research  tool. 

Our  preliminary  results,  based  on  a  small  amount  of  training 
|  to  the  speech  of  a  single  speaker  are  encouraging.  While  the 

system  will  clearly  need  more  training  and  some  logical 
modi f ications ,  we  feel  that  the  final  outcome  will  be  good. 
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4.  MULT I RATE  CODING 

4.1  Introduction 

The  goal  of  the  speech  compression  project  this  year  has 
been  to  design  and  test  an  embedded-code  multirate  adaptive 
transform  coder.  The  objective  is  to  have  a  system  capable  of 
operating  at  an  arbitrary  data  rate  in  the  range  2.5  to  9.6  kb/s. 
It  was  also  desired  to  let  the  transmission  channel  vary  the  bit 
rate  while  the  algorithm  at  the  transmitter  remained  unaffected. 
This  last  constraint  is  satisfied  by  embedding  the  codes,  i.e., 
arranging  the  codes  in  such  a  manner  that,  when  some  bits  are 
discarded  by  the  channel,  the  remainder  of  the  received  bits  can 
still  be  decoded  in  a  meaningful  way  to  produce  a  speech  output. 

To  meet  the  requirements  of  the  project,  we  chose  a 
frequency-domain  approach  or  adaptive  transform  coding  (ATC), 
combined  with  our  newly  developed  methods  of  high  frequency 
regeneration  (HFR)  by  spectral  duplication.  The  combined  system 
is  basically  a  linear  prediction  (LPC)  vocoder  operating  at  2.5 
kb/s  and  embedded  in  a  voice-excited  coder  which  provides  higher 
speech  quality  at  higher  data  rates.  The  voice-excited  coder, 
realized  in  the  frequency  domain  by  ATC  techniques,  always 
operates  at  a  fixed  bit  rate,  say  16  kb/s. 


The  multirate  feature 
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of  the  system  is  realized  by  allowing  the  channel  to  strip  off 
some  of  the  bits  from  the  high  rate  system  thereby  achieving 
lower  data  rates.  Such  a  hybrid  LPC-ATC  system  offered  great 
flexibility  in  meeting  the  goals  of  the  project. 

In  the  remainder  of  this  section  we  shall  describe  the 
system  briefly  and  highlight  the  various  aspects  of  our  work. 

4.2  System  Description 

The  complete  multirate  coding  system  is  shown  in  Fig.  7.  At 
the  transmitter,  the  input  speech  signal  is  analyzed  as  in  the 
usual  LPC  vocoder,  namely,  LPC  parameters,  pitch,  and  gain  are 
derived,  quantized,  and  coded.  In  addition,  the  speech  signal  is 
inverse  filtered,  and  the  discrete  cosine  transform  (DCT)  of  the 
residual  is  taken.  With  pitch  known,  the  coefficient  of  a 
one-tap  pitch  filter  is  derived  from  the  residual.  This  pitch 
filter,  together  with  the  LPC  parameters,  form  the  spectral  model 
to  be  used  in  the  bit  allocation  process  to  perform  the  coding  of 
the  DCT.  The  coded  DCT  components  are  then  delivered  to  the 
channel  which  may  transmit  all  of  the  bits  or  suppress  some  of 
them . 

At  the  receiver,  the  DCT  codes  are  decoded  to  form  a 
baseband  DCT.  The  fullband  DCT  of  the  residual  is  then 
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reconstructed  using  the  method  of  HFR  by  spectral  duplication. 
An  inverse  DCT  yields  the  time-domain  reconstructed  residual 
waveform  to  be  used  as  input  to  the  all-pole  LPC  synthesis  filter 
to  produce  the  speech  output.  In  case  all  of  the  DCT  components 
are  suppressed  by  the  channel  (lowest  data  rate),  the  receiver 
becomes  identical  to  that  of  a  pitch-exci ted  LPC  vocoder  and  uses 
a  pulse/noise  source  as  excitation  to  the  synthesis  filter. 

4.3  Summary  of  Work  Done 
4.3.1  Modified  ATC 

Traditionally,  in  conventional  ATC,  one  codes  the  DCT  of  the 
speech  signal  itself.  Perhaps  the  most  salient  feature  in  our 
implementation  of  ATC  is  that  we  code  the  DCT  of  the  linear 
prediction  residual.  In  the  proposal  [5]  for  this  work  and  in 
the  sixth  quarterly  progress  report  (QPR  #6)  [7]  on  this 
contract,  we  explained  how  one  can  quantize  the  DCT  of  the 
residual  rather  than  the  DCT  of  speech.  We  explained  that  not 
only  both  approaches  yield  the  same  signal  to  quantization  noise 
ratio,  but  also  some  advantages  are  accrued  when  coding  the  DCT 
of  the  residual.  One  such  advantage  is  that,  when  transmitting 
the  DCT  of  the  residual,  the  all-pole  synthesis  required  at  the 
receiver  helps  smooth  frame-boundary  discontinuities  that  would 
normally  be  present  because  of  frequency-domain  quantization. 
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Another  advantage  is  that  the  DCT  of  the  residual  has  a  flat 
spectral  envelope.  This  property  is  particularly  desirable  for 
regenerating  the  missing  high-frequency  components  by  spectral 
duplication  of  the  baseband.  The  HFR  process  creates  a  fullband 
spectrum  that  contains  the  pitch  information  and  has  a  flat 
spectral  envelope.  Thus,  the  reconstructed  spectrum  is 
particularly  well  suited  for  all-pole  synthesis.  Our  approach  is 
to  be  contrasted  with  the  case  where  the  DCT  of  speech  is 
transmitted.  Such  a  case  is  the  frequency-domain  counterpart  of 
so-called  voice-excited  coders  that  transmit  a  baseband  of  the 
speech  signal.  Time-domain  implementations  of  voice-excited  and 
residual-excited  coders  have  shown  that  the  quality  of  the  output 
speech  obtained  with  residual-excited  coders  is  always  preferred 
to  that  obtained  with  voice-excited  coders. 

4.3.2  Bit  Allocation 

The  spectral  model  of  speech  used  in  bit  allocation  was 
discussed  in  QPR  #5  [3].  Briefly,  it  consists  of  two  components: 
a  smooth  (LPC)  spectral  envelope,  and  a  model  for  the  harmonic 
structure  of  the  spectrum  (the  pitch  filter).  In  QPR  #6  [7]  we 
showed  how  bit  allocation  can  be  interpreted  as  uniform 
quantization  of  the  logarithm  of  the  weighted  spectral  model. 
The  weighting  applied  to  the  spectral  model  is  the  reciprocal  of 
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the  desired  spectral  envelope  for  the  output  noise.  The  process 
of  uniform  quantization  of  the  log-spectral-model  is  shown  in 
Fig.  8.  The  dashed  curves  in  the  figure  define  the  quantization 
intervals  and  they  take  on  the  shape  of  the  desired  spectral 
envelope  of  the  output  noise.  In  the  absence  of  noise  shaping, 
these  curves  would  be  straight  horizontal  lines.  In  our 
implementation  of  bit  allocation,  the  spacing  between  the  curves 
is  5  dB  instead  of  the  traditionally  used  6  dB.  In  fact, 
ideally,  the  spacing  between  the  curves  should  not  be  the  same 
everywhere,  i.e.,  the  quantization  should  be  non-uniform.  This 
subject  was  discussed  in  detail  in  QPR  #6  [7]  . 

4.3.3  Embedded  Multirate  Coding 

For  each  input  frame,  the  transmitter  transmits  a  block  of 
bits  which  is  divided  into  two  major  parts.  The  first  part 
contains  the  bits  representing  the  system  parameters  and  the 
second  part  contains  the  bits  representing  the  DCT  components. 
One  such  block  of  bits  is  shown  in  Fig.  9.  The  first  part  of  the 
parameter  codes  is  essentially  identical  to  what  is  transmitted 
by  an  LPC  vocoder.  The  additional  parameter  codes  needed  are:  1 
pitch  tap  for  the  harmonic  model  of  the  DCT  spectrum,  and  HFR 
codes  to  be  discussed  in  Section  4.3.4.  The  maximum  data  rate  of 
16,000  b/s  is  achieved  when  the  DCT  of  the  fullband  residual  is 
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FIG.  7.  Bit  Allocation  in  ATC  by  Uniform  Quantization  of  the 
Log-Spectral-Model 

transmitted.  The  minimum  data  rate  achievable  by  the  system 
takes  place  when  all  the  codes  representing  the  DCT  components 
are  suppressed  by  the  channel.  Intermediate  data  rates  are 
achieved  by  stripping  off  bits  from  the  high  data  rate  system. 
Stripping  off  bits  results  in  the  suppression  of  low-variance 
frequency  components  and/or  high-frequency  components,  depend i no 
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on  how  the  stripping  is  performed.  This  aspect  of  the  system  is 
explained  below. 

It  is  worth  mentioning  here  that  the  transmitter  itself  can 
strip  off  bits  prior  to  transmission  in  the  same  manner  as  is 
done  by  the  channel.  Thus,  when  a  system  is  first  turned  on,  the 
initial  bit  rate  need  not  be  high  and  traffic  congestion  can  be 
avoided.  Note  that  the  receiver  does  not  need  to  know  where  in 
the  system  the  bits  are  discarded. 

The  codes  representing  the  DCT  components  are  arranged  in  a 
certain  order  prior  to  transmission.  This  ordering  determines 
which  bits  get  discarded  first,  which,  in  turn,  determines 
whether  high  frequency  components  or  low  variance  components  get 
discarded.  To  study  the  tradeoff  between  the  number  of 
transmitted  bits,  the  quantization  accuracy,  and  the  number  of 
transmitted  frequency  components,  we  investigated  three 
bit-ordering  techniques.  These  were  explained  in  detail  in  QPR 
#7  [6]  and  we  summarize  them  below. 

The  first  bit-ordering  technique,  which  we  call  the  baseband 
coder  approach,  is  the  simplest:  the  codes  are  arranged  by  order 
of  increasing  frequency.  When  the  channel  strips  off  bits  from 
the  end  of  each  block,  the  high  frequency  components  are 
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FIG.  8.  One  Block  of  Bits  Per  Frame  Illustrating  the 
Embedded-Code  Multirate  Feature  of  the  System 

discarded  first.  The  remaining  codes  represent  a  low  frequency 
portion  of  the  total  bandwidth  referred  to  as  a  baseband.  As  in 
a  baseband  coder,  the  receiver  regenerates  the  missing 
high-frequency  components.  We  use  the  method  of  high-f requency 
regeneration  (HFR)  by  spectral  duplication,  which  is  explained  in 
Section  4.3,4. 
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In  the  second  bit-ordering  technique  we  discard  the  least 
significant  bit  of  each  DCT  code,  starting  with  the  high 
frequency  components,  until  the  desired  rate  is  reached.  If  need 
be,  the  next-to-least  significant  bits  of  each  code  are  also 
discarded.  This  method  decreases  the  quantization  accuracy 
uniformly  over  the  whole  band.  It  is  equivalent  to  having 
performed  the  initial  bit  allocation  with  fewer  bits,  e.g.,  120 
bits  instead  of  250  bits  for  a  rate  of  9  kb/s  instead  of  16  kb/s. 

In  the  third  ordering  technique,  the  discarded  bits 
represent  DCT  components  with  a  relatively  low  variance.  Since 
the  code length  obtained  by  bit  allocation  is  directly 
proportional  to  the  relative  variance  of  the  DCT  components,  this 
technique  is  realized  by  discarding  all  1-bit  codes,  then  all 
2-bit  codes,  etc...,  until  the  desired  rate  is  achieved. 
Referring  back  to  Fig.  8,  we  can  see  that  this  technique 
corresponds  to  having  merged  together  into  one  large  level,  two 
or  more  inner  levels  of  the  uniform  quantization  of  the 
log-spectral-model.  For  example,  at  9.6  kb/s,  the  method 
corresponds  roughly  to  having  merged  the  0-bit  and  1-bit  steps 
into  one  large  0-bit  step. 

When  comparison  is  made  at  the  same  bit  rate,  the  second  and 
third  ordering  techniques  yield  a  baseband  that  is  generally 
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narrower  than  that  of  the  baseband  coder  approach.  In  exchange 
for  a  smaller  baseband  width,  the  second  and  third  techniques 
provide  some  additional  DCT  components  scattered  in  the 
high-frequency  region.  Further  details  on  embedded  coding  are 
found  in  QPR  #7  [6],  where  we  concluded  that  the  first 
bit-ordering  technique,  i.e.,  the  baseband  coder  approach,  yields 
the  best  output  speech  quality  for  data  rates  in  the  range  6.4  to 
9.6  kb/s. 

4.3.4  High  Frequency  Regeneration 

The  aim  of  high  frequency  regeneration  (HFR)  is  to  recreate 
the  missing  high  frequency  components  of  a  signal. 
Traditionally,  some  form  of  non-linear  distortion  of  the 
time-domain  signal  is  used,  e.g.,  waveform  rectif ication.  More 
recently,  we  introduced  HFR  methods  where  the  missing  components 
are  regenerated  by  duplicating  the  baseband  components  at  high 
frequencies  [8].  We  call  this  approach  the  spectral  duplication 
method  of  HFR.  Spectral  duplication  aims  at  reconstruct ina  a 
fullband  spectrum  that  has  a  harmonic  structure  and  a  flat 
spectral  envelope.  Care  must  be  taken  not  to  interrupt  the 
harmonic  structure  of  the  signal  spectrum.  Our  initial  efforts 
were  described  in  a  previous  annual  report  [1]  while  our  more 
recent  work  is  detailed  in  OPR  #7  [6].  We  now  summarize  the 
final  algorithm  that  we  are  currently  using. 


62 


Report  No.  4414 


Bolt  Beranek  and  Newman  Inc. 


The  HFR  method  is  illustrated  in  Fig.  10.  The  analysis 
process  needed  at  the  transmitter  is  as  follows.  First,  a 
nominal  baseband  of  fixed  width  (1  kH2)  is  translated  up  in 
frequency  to  fill  the  region  from  1  to  2  kHz.  Second,  to  find  an 
optimal  position  for  the  baseband,  the  baseband  is 
cross-correlated  with  the  actual  fullband  DCT  present  in  that 
region.  The  lag  at  which  the  correlation  is  maximum  is 
interpreted  to  be  the  point  where  the  baseband  best  duplicates  or 
best  matches  the  original  DCT  components.  It  is  the  harmonic 
structure  of  the  DCT  that  helps  lock  the  baseband  into  place. 
Thus,  the  method  preserves  the  harmonic  continuity  of  the  DCT. 
The  lag  value  is  transmitted  by  means  of  a  3-bit  HFR  code,  which 
accommodates  lags  between  -3  and  +4  spectral  points.  The  same 
process  is  repeated  for  higher  frequency  bands. 

At  the  receiver,  the  received  baseband  is  translated  to  its 
nominal  position  (1  to  2  kHz)  and  additionally  shifted  by  a  few 
spectral  points  as  indicated  by  the  HFR  code.  Since  the  received 
baseband  is  seldom  equal  to  1  kHz  in  width,  the  regenerated  bands 
may  either  overlap  one  another  or  have  gaps  between  them.  These 
cases  are  easily  taken  care  of  (OPR  #7  [6]  )  .  We  now  report  on 
the  latest  results  obtained  with  the  most  recent  versions  of  the 
various  aspects  of  the  multirate  system. 


63 


Report  No.  4414 


Bolt  Beranek  and  Newman  Inc- 


A.  At  the  transmitter 

1.  Choose  point  of  maximum  correlation. 

2.  Transmit  3~bit  HFR  codes 


FREQUENCY 


■  B.  At  the  recei ver 


1.  Translate  received  baseband 

2.  Fill  gaps 


FIG. 


9. 


High  Frequency  Regeneration  by  Spectral  Duplication 
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4.3.5  Results 

With  the  HFR  method  as  outlined  above,  the  results  for  the 
baseband  coder  approach  of  embedded  multirate  coding  are  shown  in 
Table  1. 


TOTAL  RATE 
kb/s 

AVERAGE  RECEIVED 
BASEBAND  WIDTH 

Hz 

AVERAGE 

CODELENGTH 

bits/sampl 

16.0 

3333 

1.95 

9.6 

1400 

2.3 

7.2 

870 

2.4 

6.4 

670 

2.5 

TABLE  1.  Results  Obtained  with  the  Baseband  Coder  Approach  for 
Multirate  Coding 

From  the  table  it  can  be  seen  that  at  6.4  kb/s  the  average  width 
of  the  received  baseband  is  about  670  Hz.  This  is  quite  a  narrow 
baseband  and  constitutes  the  major  obstacle  for  lowering  the  data 
rate  any  further.  At  6.4  kb/s,  the  quality  of  the  synthetic 
speech  is  acceptable,  but  lowering  the  bit  rate  further  causes 
the  speech  to  be  quite  rough,  hollow-sounding,  and  with 
occasional  pops.  Thus,  for  rat^s  below  6.4  kb/s,  we  recommend 
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using  the  2.5  kb/s  LPC  vocoder.  For  rates  larger  than  6.4  kb/s, 
the  average  baseband  width  is  substantially  increased  and  the 
quality  of  the  speech  improves  markedly.  The  quality  at  16  kb/s 
is  very  close  to  the  original. 

4.4  Conclusions 

In  this  project,  we  have  capitalized  on  the  advantages  of  a 
frequency-domain  approach  to  realize  a  versatile  embedded-code 
multirate  speech  coding  system.  Some  of  the  advantages  of  our 
system  are:  the  ease  with  which  spectral  noise-shaping  can  be 
implemented,  the  ease  with  which  the  tradeoff  between 
quantization  accuracy  and  baseband  width  can  be  studied,  the  ease 
with  which  the  codes  can  be  embedded  for  multirate  operation,  and 
the  ease  with  which  high  frequency  regeneration  can  be  done  for 
effective  voice  excitation.  In  addition,  the  system  is  suitable 
for  real-time  implementation  on  existing  array  processors. 
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APPENDIX  A 

CATEGORIZATION  OF  DIPHONES 
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C3#C4  DIPHONES 
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APPENDIX  B 

EXHAUSTIVE  LIST  OF  2733  DIPHONES 
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boding  iATC).  At  the  receiver,  high-frequency  regeneration  aa  acccmpllsned  ay  our 
recently  developed  method  of  spectral  duplication,  Tare  as  taxer,  to  ensure  ar.at  ar.e 
harmonic  structure  as  not  interruptea  as  a  result  of  ar.e  spectral  duplication  process. 

The  results  are  compared  to  time-domain  cooing  cf  ar.e  aaseoand  residual. 

Introduction 

Baseband  coders,  or  ar.at  are  or. car.  also  as  vpice-excaaed  aiders  were  criganally 

proposed  as  a  compromise  Setween  pitoh-exoiaec  coders  .  suer,  as  1?C,  ar.anr.el  arc  r.omcmerpr.ip 
vocoders)  and  waveform  coders.  Today,  caseoar.d  coders  offer  atp.-acaive  alaerrapives  at  data  rates 
in  ar.e  range  c.a-9.5  kb/s.  This  range  of  data  rates  cas  oecoae  increasingly  aapertana  teoause 
modems  are  now  avaiiaoie  that  operate  reiiaDly  in  ar.at  range  aver  regular  telephone  lines. 

Below,  we  sr.all  assume  ar.ac.  at  a.he 
receiver,  ar.e  synthesizer  obeys  are 

SYNTHETIC  general  synthesis  mcdel  snswn  ar.  rig. 

t — Zr.  the  figure,  the  synthetic  ar 

SPEECH  reconstructed  speecn  signal  r1 t ,  as 

generated  as  the  result  cf  applying  a 
time-varying  excitation  signal,  us.  t,  as 

rig.  1  3asic  svnehesis  model  for  Baseband  coder.  input  to  a  time-varying  spectra*  sr.apir.g 

filter.  The  spectral  envelope  of  tne  ex¬ 
citation  is  assumed  to  se  flac,  so  that  the  spectral  envelope  of  the  synthetic  speech  is 

determined  completely  By  the  spectral  3napmg  filter.  The  parameters  of  tne  modei,  i.e.  ,  tr.e 

excitation  ar.d  the  filter,  must  Be  computed  and  transmitted  periodically  by  tr.e  transmitter. 
Those  parameters  that  represent  the  speech  spectrum,  denoted  as  spectral  parameters,  are  computed , 
quantized  and  transmitted  every  15-30  ms.  Zr  a  baseaand  ooaer,  a  low- frequency  portion  of  the 
excitation,  known  as  a  Baseband .  is  transmitted  and  used  at  tne  receiver  to  regenerate  tr.e 
high-frequency  portion  of  the  excitation.  The  sum  of  ihe  orar.smitied  oaseoar.d  ar.d  tr.e  regenerated 
hign-frequency  band  constitute  the  excitation  u(tj  to  the  synthesizer. 

In  a  basenand  coder,  synthetic  speecn  quality  as  qetermined  cy  four  factors:  a)  width  of  tne 
baseoand,  b;  coding  bf  the  oaseOand,  c)  estimation  ana  coding  of  spectral  parameters ,  ar.e  a;  the 
high-frequency  regeneration  (ciFR)  method  employed.  In  this  paper  we  3nall  Por.centrape  mainly  :n 
the  second  and  fourth  factors. 

bJaaiaualnfasiisfl  .Sasstanfl.  Sasto. 

Figure  2  snows  the  transmitter  portion  of  a  digital  baseoand  eccer,  based  on  a  linear 
prediction  !t?C!  representation.  The  speech  signal  3it).  sampled  at  2»  Hz.  is  filtered  with  the 
L?C  inverse  filter  A(zj  to  produce  the  residual  waveform  ait;.  Tne  subsequent  peeing  :f  ait.  oay 
be  quire  simple,  using  adaptive  quantization  APTH),  or  may  be  more  complicated,  esp.cyir.g 
adaptive  predictive  coding  (APC),  or  sub-bar.d  coding  i£3C)  techniques.  In  this  paper,  we  propose 
the  use  of  Adaptive  Transform  Coding  (AtC)  techniques  i 9 . *  0 , ’ *  *  to  ocqe  the  baseoand  residual. 


Tig.  1  Transmitter  for  adaptiva-cransf ora  baseband  coder. 

As  shown  in  rig.  2,  the  Jlsorete  cosine  transform  (3CT)  of  tr.e  residual  eit,  13  obtained  3nc 
coded  using  ATC  techniques.  The  necessary  lowpass  or  bandpass)  eperaoion  needed  to  retain  t.-.e 
low  frequency  portion  of  tne  residual  is  not  snown  explicitly  in  tr.e  figure,  since  it  car.  pe  dene 
direct. y  in  the  posine  transform  domain.  Inly  the  baseoand  portion  of  the  transform  is  ccceo  sno 
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:rari£fliict6£ .  The  aecftcd  of  coding  is  explained  in  tne  next  section.  Ac  me  receiver  see  ? ig . 
2),  the  casKS  are  to  regenerate  the  high-frequency  oortic.  ::  tr.e  TCT  :f  cr.e  extitac ton  sigr3- . 
cerfcra  an  inverse  :os:ne  transformation ,  and  syntr.esice  cr.e  outset  speecr.,  vs  mg  me  tcse. 

.-is. 


FROM 

CHANNEL 


3 

M 

- 

DECODING 

OF 

3ASEBAND 

— J 

HIGH-FREOUENCY 

J 

INVERSE 

COSINE 

*RANSF0RM 

■^r 

L 

REGENERATION 

p 

I 

£ 

PARAMETERS 

Fig.  j  Receiver  roc  adapC ive-lrinsfara  baseband  roder. 

Transform  Coding...  2L  tne  baseband 

Adaptive  transform  coding  has  been  recently  snown  to  ce  ar.  efficient  metr.bc  fir  digital 
transmission  of  the  speeen  signal  [9,  10,11],  However,  the  available  ATC  methods  leal  witn  ".r.e 
transform  cf  tne  speeon  signal  itself  rather  than  that  of  the  residual.  There  are  several  reasons 
to  prefer  using  ATC  to  code  the  residual  rather  than  the  speecr.  signal.  Tne  important  reason 
that  the  synthesis  filter  i/A(z)  (see  Fig.  2)  smooths  the  possible  fraae-bour.aary  nsccr.tir.uities 
baused  by  quantization.  Thus  no  overlap  between  frames  is  necessary.  «e  have  verified  the  above 
statement  experimentally  far  the  case  of  the  aaseoand  cocer.  Tr.  the  fut:re.  we  tope  tc  reacr.  a 
similar  conclusion  for  the  full-band  case.  Tther  advantages  ir.  isir.g  ATT  to  :ode  tne  baseband 
residual  are  mentioned  later  in  this  section. 

ATC  of  speech.  Here,  we  briefly  review  ATC  bf  speeon  ar.d  give  it  a  tew  ir.terpretat itr. .  wr.iir. 
we  use  to  introduce  the  changes  that  are  necessary  far  using  ATC  tc  icce  the  residual. 

In  quantizing  schemes,  an  often  used  performance  criterion  is  to  maximize  the 
signai-ta-duantiiaticn-noise  (S/3)  ratio,  for  a  fixed  number  cf  tits.  It  tar.  :e  sr.owr.  tr.at  tne 
3/3  ratio  is  maximibed  by  quantizing  each  of  the  transform  XT)  toefficients  tsi.ng  '.r.e  tame 
duantinacicr  step  sice,  and  wnere  the  availaole  oits  are  properly  a-locatei  imer.g  t  r.e 
coefficients.  The  allocation  of  bits  (or  quantizer  levels)  deper.Os  on  the  mode.  :f  the  3CT 
spectrum  that  is  employed.  It  can  be  shown  that,  wnen  the  number  of  levels  vseo  is  set  to  be 
proportional  to  the  magnitude  of  the  spectral  model,  the  gain  in  3,3  afftrdea  by  ATC  over  A?CM  of 
speech,  is  equal  to  the  ratio  of  the  arithmetic  mean  to  the  geometric  mean  of  the  model.  For 
speeon,  a  good  model  to  use  consists  of  a  smooth  spectral  (w?C)  envelcce  ano  a  harmonic  structure 
to  model  pitcn  I '01. 

An  alternate  way  of  viewing  the  quantization  process  in  ATC  is  to  assume  that  we  first 
"normalize”  the  CCT  coefficients  by  dividing  them  by  tne  corresponding  magnitude  of  the  spectra, 
model.  Then,  to  achieve  tne  same  3/Q  as  before,  we  must  use  the  same  .evels  number  of  bits  a3 
in  the  non-noraaiimed  case,  but  we  must  vary  tne  quantization  step-size  for  eacn  qoefficient  to  te 
inversely  proportional  to  the  spectral  model  amplitudes  at  tnac  frequency.  This  alternate  view  cf 
ATC  paves  the  way  for  'using  ATC  la  coding  the  residual. 

ATC  of  tne  residual .  The  3CT  of  tne  residual  can  be  considered  to  be  equivalent  tc  the 

"normalized”  XT  of  speech.  The  reason  is  that  inverse  filtering  in  tne  time  icmair.  is 
approximately  equivalent  in  the  Tansfora  domain  to  dividing  tne  CCT  coef fluents  :f  the  speecr.  ty 
the  magnitude  of  the  LPC  spectral  envelope.  Therefore,  by  using  the  same  numoer  cf  bits  allocated 
at  eacn  frequency  as  in  the  usual  ATC  case,  and  varying  the  step-sloe  for  eacn  XT  coefficient  m 
proportion  to  the  magnitude  spectrum  cf  the  L?C  inverse  filter,  one  maintains  the  same  3.3  ratic 
as  in  ATC  cf  tne  speech. 

Tn  recent  experiments  l ’2]  with  the  baseBanq  cooer,  we  useq  APCM  tc  code  tne  baseoanc 

residual.  In  these  experiments,  we  were  able  to  determine  that  some  of  the  roughness  in  tne 

reconstructed  speech  signal  was  due  to  quantization  noise.  We  feel  that  in  our  proposed  baseoar.d 
coder,  ATC  of  the  baseband  will  provide  a  sufficient  Increase  in  3,3  over  APCM  to  r.elo  eliminate 
the  roughness  caused  by  quantization  noise.  We  shall  also  incorporate  spectral  r.oise  sr.apir.g  into 
the  design  of  the  system,  as  has  been  done  on  the  full-band  speech 

We  note  here  that  the  proposed  scheme  is  a  frequency-domain  coder,  and  tr.us.  it  lends  itself 
nieely  to  the  pitch-adaptive  HFR  method  to  be  discussed  in  the  next  section.  A  further  advantage 
of  the  proposed  soneae  is  that  It  is  appiicaole  to  the  full-band  residual,  or  any  torticn  thereof. 
Thus,  it  lends  itself  quite  easily  to  situations  where  a  multi.-ate  system  is  iesireo.  For  tne 
full-oand  case,  the  transmission  rate  is  ib  ab/s.  For  lower  rates,  cn.y  1  sasecanc  .s 
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transmitted.  The  width  of  :ne  baseband  is  letermmea  by  nr.e  d.-.ar.r.e-  pane width, 
are  achieved  ay  increasing  dr  decreasing  the  r.umoer  a:'  transmitted  1CT  cceffici 
ar  narrower  caseoar.ds.  «e  alar,  no  nest  aur  acr.ece  in  a  auitirace  environment . 


Various  cin--ates 
er.ns .  i .  e .  ,  wider 


It  is  well-known  nnan  if  nne  baseband  nas  sinner  the  voice  fundamental  ar  an  least  two 
adjacent  r.arsonics.  a  wavefara  containing  ail  nr.e  harmonics  af  voiced  input  apeecr.  can  ae 
generated  ay  feeding  the  baseband  signal  no  an  instantaneous,  zero-memory,  nonlinear  cevice.  1 
traditional  approacn  used  in  the  past  to  provide  the  requisite  nonlinearity  nas  seen  seme  farm  af 
waveform  rectification.  »e  have  presentea  elsewhere  'd-i  a  hew  HFH  nethod  taseo  an  lupnaatian  af 
the  aaseoand  speotr’co.  In  particular,  we  presented  tiae-.iomair.  systems  that  perfara  spectral 
duplication  in  eaon  of  two  ways:  la)  Spectral  folding,  and  a)  spectral  translation.  fig.  ■» 
illustrates  nr.e  effects  of  the  two  aechods.  In  the  figure,  w  denotes  the  total  input  aancwictn 
and  5  denotes  tne  width  af  the  baseoand,  with  «L3,  wnere  1  is  an  integer. 

Our  experience  with  the  aoove  petitioned 
e  /  / •C~~n~~d  integer-band  spectral  duplication  bet.-.ods  is  than 

1/  j/  N.  \  they  do  not  3eea  no  cause  any  percepnctis 

i  i  /  />.  \.  roughness,  as  is  nhe  case  with  wavefara 

_ '/  /  IN  \l _ _  reocifioation  aennods.  However,  some  low-Ievei 

-3  0  3  background  nones  are  audibie  with  nhe  new  HFH 

aethods.  A  possible  reason  for  nhe  exisner.ee  of 
nhese  background  nones  is  n.ne  facn  nnat  nhe 
harmonic  structure  is  ir.terrupnad  at  aulnipies  af 
/  ’7r\  XT/1  /  \  \  f  /'  /f\  \]  nhe  aaseband  width  3  Hz.  Therefore,  we 
•3  /  /  I  \  N/  /  \  \/ /  \  '  hypothesized  Chat  the  nones  could  be  eicacnatea  by 

/  /TV  \,  /  / \  N.  ,/  /  \  v,  adjusting  the  width  3  af  tne  aasebar.d  to  ae  a 

*-  w  _l  \  .\lk  — a  *  /  — ci  — v —  auitiple  of  tne  pitch  fur.camentai  frequency  on  a 

.?3  -3  0  3  23  W=38  short-term  aasis.  ouch  a  scr.eme  wou-d  require  an 

enormous  amount  of  computation  if  it  were  no  ce 
{  implemented  in  the  time  domain.  3elau,  we  sr.all 

.  . .  .  '.I.'  i  ■  i  —  yj.  ~~ «  -  explain  a  rrequency-aomam  pitor.-aoaptive  spectral 

:ransiJti“'n  aecnoa- 

/  /\/  /  /  /|\  \  \  n  \  \  The  idea  here  Is  based  cn  tne  fact  that,  in 

— IT — -  -,n  tne  adaptive-transfara  baseoand  cccer.  tne 

""  *43  u  3  -B  w  op  sasepana  ICT  components  car.  ce  easily  nupiicateo 

at  higner  frequencies,  to  cotam  the  fuil-oanc 
T.-.  i  a)  baseband  srectrca  excitation  signal.  In  order  not  no  interrupt  the 

I  -  ,  . ..  harmonic  structure  of  the  excitation  signa-,  an 

“!  f-caoi  saeonrau.  lOiaisg  estimate  of  the  pitoh  must  pe  used.  In  case  nne 

o  2-band  spectral  translation.  spectral  model  at  the  transmitter  maxes  cse  of 

pitch,  the  value  of  piton  is  n.-ansmmteo  ano  is 
readily  available  at  nne  receiver.  Otherwise,  the  receiver  can  easily  extract  a  pitch  value  from 
one  received  baseoand,  e.g.,  by  detecting  the  location  of  the  peak  of  the  autocorrelation  :f  nr.e 
baseoand.  With  pitch  known,  the  receiver  extracts  a  supincervai  of  nhe  baseoand  containing  an 
integer  numoer  af  harmonics.  The  chosen  subinterval  is  duplicated  .translated  at  higher 
frequencies,  as  many  times  as  necessary,  to  fill  the  missing  frequency  ncccor.ents. 


readily 
one  reoe 
baseband 
integer 
t  requenc 


For  voiced  sounds,  we  found  that  good  perceptual  results  are  obtained  when  nr.e  suci.-.nerva! 
extends  from  the  spectral  valley  Just  before  the  first  harmonic  to  the  valley  just  after  nr.e  last 
complete  harmonic  present  ir.  tne  baseoand.  The  upper  frequency  edge  of  the  sucinterva.  -s  I  Hz, 
and  is  pitch-dependent.  For  unvoiced  sounds,  the  susintervai  consists  of  nhe  wncle  basebar.q.  less 
its  two  end  points:  the  d.o.  component  and  t.he  component  3t  3  Hz.  Following  ihe  HFB  process,  an 
inverse  OCT  yields  the  full-band  time-domain  excitation  signal  to  be  applied  to  the  synthesis 
filter  l/Aiz).  <>e  note  here  that  the  effective  baseband  width  is  C<5.  The  received  frequency 
components  between  C  and  3  are  discarded,  and  those  Between  C  and  »  are  regenerated. 

One  possible  extension  of  the  above  described  HFH  methoo  is  to  let  the  transmitter  locate  ana 
extract  nhe  subinterval  of  the  baseband.  In  such  a  case  nhe  transmitted  baseband  would  be 
plbch-eCapcive  and  df  width  C  Hz. 

A  second,  and  perceptually  more  important  extension  of  nhe  method,  provides  for  a  tetter 
placement  of  tne  frequency-translated  intervals.  In  the  above  described  methoo  cf  HFH.  -we  assumeq 
that  nhe  spectrum  of  voiced  sounds  is  periodic  in  frequency.  However,  we  know  that  speed.",  spectra 
are  not  exactly  harmonic  in  structure.  To  take  into  account  tne  irregularities  of  nne  speech 
spectrum,  we  shift  the  high-frequency  interval  around  its  nominal  position,  in  such  a  manner  as  nc 
match,  as  Pest  as  possible,  tne  corresponding  original  3CT  components  sf  tne  full-band  residua-. 
This  task  is  lone  at  the  transmitter  wnere  the  3CT  af  tne  full-band  is  avaiiab-s.  First,  nr.e 
chosen  subinterval  is  nranslated  no  its  .nominal  mgr.- frequency  position.  Then.  it  .s 
cross-correlated  with  nne  corresponding  ICT  components  of  nr.e  full-band  residua-.  The  -optima-" 
location  is  then  chosen  to  be  at  the  positive  maximum  value  cf  nhe  cross-correlation.  When  sr.orn 
correlation  lags  are  considered,  i.e.,  between  J  ar.a  3,  nr.e  additional  cosi  is  only  I  cits  for 
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eac n  frequency-translated  interval.  The  transmitted  informat ion  indicates  :o  :r.e  receiver  wr.ere 
:o  place  the  paseoar.d  suemterval  in  the  high  frequency  region,  relative  to  its  nominal  position. 

?"eiimr^rv  Results 

We  save  implemented  a  preliminary  version  at’  pr.e  propcsec  adaptive-transform  baseband  sooer 
and  simulated  its  operation  at  a  transmission  race  of  J.o  <o/s.  For  inputs  at  a  sampling  rate  of 
3.6?  kHz,  we  save  or.oaen  the  frame  size  to  3e  19. 2  ms.  i.s.,  123  samples.  At  eacn  frame,  a 
128-polnt  DCT  of  the  residual  is  obtained,  and  only  the  first  so  loeffloients  3=1.!  <Hz 1  are 
recaineo.  In  our  experiments  thus  far,  we  transmitted  tr.e  oaseoand  resiPuai  using  APCM. 
Parameters  of  the  spectral  model  used  during  analysis  ,  LPC  and  jam  parameters;  are  transmitted 
separately  at  the  rate  of  2.3  ko/s,  leaving  about  To  kb/s  for  tr.e  oaseoand  residual. 

In  a  first  experiment,  we  implemented  integer-band  speotrai  translation  in  tr.e  OCT  domain. 
We  found  the  frequency-domain  resulcs  to  Oe  peroepcualiy  similar  to  the  time-domain  results,  with 
low-level  tones  and  no  roughness  in  the  background.  In  a  second  experiment,  we  implemented  the 
proposed  piton-aoaptive  HFS  method  and  found  that  it  largely  eliminates  tne  low-level  tones,  out 
it  introduces  a  certain  amount  of  roughness  reminiscent  of  reccifioacion.  Finally,  m  a  third 
experiment,  we  performed  the  pitch-adaptive  HF8  method  with  tr.e  added  cross-correlation  feature 
for  better  spectral  duplication.  Upon  informal  listening,  we  found  that  this  system  provides  1 
marked  improvement  in  speech  quality  over  the  traditional  waveform  reotif ication  approach  and  ever 
the  r.on-pitoh-adapcive  time-domain  speotrai  duplication  methods.  We  expect  the  quality  of  tne 
synthetic  speech  to  improve  further  when  we  change  the  coding  of  the  baseband  from  APCM  to  ATC. 

Conclusions 

In  this  paper  we  described  an  adaptive-transform  baseband  coder.  The  salient  features  of 
this  coder  are:  '.a;  Transmitting  one  DCT  of  the  oaseoand  residual,  and  ib)  regenerating  tr.e 
missing  nign  frequency  components  in  a  pitch-adaptive  manner.  Since  the  transmitted  information 
is  in  the  frequency  domain,  the  method  lends  itself  very  easily  to  the  piton-adapcive  spectral 
translation  method  of  high-frequency  regeneration.  Also,  the  method  we  describee  for  transform 
dodlng  ohe  residual  is  appiioaoie  to  the  full-band  residual,  or  any  fraction  thereof.  Therefore, 
this  coder  is  an  attractive  possibility  as  a  multi-race  system. 
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Absinet-The  discrete  cosine  transform  (DCT)  of  an  A' -point  real 
signal  is  derived  by  taking  the  discrete  Fourier  transform  (DFT)  of  a 
2.V  point  even  extension  of  the  signal.  It  is  shown  that  the  same  result 
may  be  obtained  using  only  an  A'-poin !  OFT  of  a  reordered  version  of 
the  original  signal,  with  a  resulting  saving  of  1/2.  If  the  fast  Fourier 
transform  (FFT)  a  used  to  compute  the  DFT,  the  result  is  a  fast 
cosine  transform  (FCT)  that  can  be  computed  using  on  the  ordei  of 
A1  log;  ,V  real  multiplications.  The  method  is  then  extended  to  two  di¬ 
mensions,  with  a  saving  of  1/4  ovet  the  traditional  method  that  uses  the 
DFT. 


I.  Introduction 

HE  discrete  cosine  transform  (.DCT)  has  had  a  number  of 
applications  in  image  processing  (see  ( I J )  and,  more  re¬ 
cently,  in  speech  processing  [2),  J3J .  Compared  to  other 
orthogone!  transforms,  its  performance  seems  to  compare 
most  favorably  with  the  optimal  Karhunen-Loeve  transform 
of  a  large  number  of  signal  classes  [2|.  [4],  ft  has  been  shown 
that  the  DCT  of  an  /V-point  signal  can  be  computed  using  a 
2/V-point  discrete  Fourier  transform  (DFT)  (4].  Chen  et  at. 
[IJ  used  matrix  factorization  to  derive  a  special  algorithm  to 
compute  the  DCT  of  a  signal  with  JV  a  power  2,  resulting  in  a 
saving  of  1/2  over  the  previous  method  when  the  latter  uses 
the  fast  Fourier  transform  (FFT).1  More  recently ,  Narasimha 
and  Peterson  (7J  developed  a  method  that  employs  an  Ap¬ 
point  DFT  of  a  reordered  version  of  the  signal  (where  A'  is 
assumed  to  be  even),  resulting  in  a  similar  saving  of  1/2.  When 
JV  is  a  power  of  2.  use  of  the  FFT  results  in  a  saving  compara¬ 
ble  to  that  of  Chen  el  al.  {!).  However,  in  the  algorithm  of 
Narasimha  and  Peterson,  one  can  use  existing  software  to  com¬ 
pute  the  FFT  instead  of  implementing  a  special  algorithm  for 
the  DCT. 

The  algorithm  presented  here  for  the  l-D  case  is  essentially 
identical  to  that  of  Narasimha  and  Peterson  (7). 2  Our  algo¬ 
rithm  is  more  general  in  that  N  nuy  be  odd  or  even.  This 
generalization  and  the  extension  to  ihe  2-Dcase  are  facilitated 

Manuscript  received  November  27.  1978,  revised  Aptd  25,  1979  and 
August  28.  1979.  This  work  was  supported  b>  the  Advanced  Research 
Projects  Agency  and  monitored  b>  RADC./LTC  under  Contract  I  19628- 
78-C-OI 36. 

The  author  is  with  Bolt  Beranek  and  Newman,  Inc.,  Cambridge,  MA 
02138. 

'Oien  et  al  ( 1 1  claim  a  larger  saving.  However,  it  in  (he  conventional 
method  one  takes  advantage  of  the  faet  that  (he  signal  is  real,  then  the 
saving  amounts  to  only  1/2. 

J(7|  was  unknown  to  the  author  when  this  paper  was  first  sub¬ 
mitted  for  publication.  Tile  author  thanks  R.  Crochierc  and  the  re¬ 
viewers  for  bringing  |7)  to  his  attention  The  runs  of  ihis  piper  that 
overbp  (7)  have  been  retained  to  enhance  the  luional  aspect  ol  this 
paper. 


by  the  view  taken  that  the  DCT  can  be  regarded  essentially  as 
the  DFT  of  an  even  extension  of  the  signal.  The  generalization 
to  the  m-D  case  should  then  be  straightforward,  with  a  re¬ 
sulting  saving  of  l/2m  over  the  traditional  method  that  em¬ 
ploys  the  DFT.  In  (7)  the  authors  mention  that  the  inverse 
DCT  can  be  obtained  using  a  number  of  computations  equal 
lo  that  of  the  forward  DCT.  Here,  we  show  how  this  can  be 
done.  Procedures  for  the  forward  and  inverse  fast  costne  trans¬ 
forms  are  presented  for  easy  implementation  on  the  computer. 

Finally,  an  appendix  is  included  that  presents  a  method  and 
associated  flowgraphs  for  efficient  computation  of  the  DFT  of 
a  real  sequence  and  the  1DFT  of  a  Hermitian  symmetric  se¬ 
quence.  The  flowgraphs  are  believed  lo  be  novel. 

II.  Discrete  Cosine  Transform 

To  motivate  the  derivation  of  the  DCT  presented  below,  we 
shall  first  review  some  basic  discrete-time  Fourier  theory. 

Let  x(n)  be  a  discrete-tune  signal  and  X(uj)  its  Fourier  trans¬ 
form.  In  one  definition  the  cosine  transform  of  x{n)  is  the 
real  part  of  AT (to).  The  real  part  of  X (to)  is  also  equal  to  the 
Fourier  tiansform  of  the  even  part  ofx(n),  defined  by  xt(n)  = 
[jc (n>  +x(-n)J/2  (see  [8),  for  example).  Therefore,  the 
cosine  transform  of  x(n)  is  equal  to  the  Fourier  transform  of 
xe(rr).  Now,  if  x(n)  is  calisal,  i.e.,  x(n)  =  0  for  n<0.  then 
xt(n )  and,  therefore,  the  cosine  transform  uniquely  specifies 
x(n).  In  that  case, x,(n)  can  be  thought  of  as  an  even  exten¬ 
sion  of  x(«).  Therefore,  the  cosine  transform  of  a  causal  x(n) 
can  be  obtained  as  the  Fourier  Iransfomi  of  an  even  extension 
of  x(n).  This  viewpoint  forms  the  basis  of  the  derivation  of 
the  DCT  below. 

As  an  example.  Fig.  1(a)  shows  a  causal  signal  x(n),  and  Fig. 
1(b)  shows  an  even  extension  of  x(n),yt  (n)  =  x(n)  +  x(-n). 
(>•,  (n)  is  equal  to  twice  the  even  part  of  x(n).J  Another  pos¬ 
sible  even  extension  of  Jt(n)  is  shown  in  Fig.  1(c),  where 
y2(n)  =  x(n)  +  x(-n  -  1 ).  y  i(n)  is  even  about  the  point  n  =  0 
while y3(«)  is  even  about  n  =  -0.5.  The  Fourier  transform  of 
y,(«)  is  real,  while  the  Fourier  transform  of  y2(/i)  contains  a 
linear-phase  term  corresponding  lo  Ihe  half-sample  offset. 
Cosine  transforms  based  on  y,(«)oty2(n)can  be  defined  and 
from  which  r(rr)  can  be  determined  uniquely. 

In  the  example  above  we  assumed  that  the  Fourier  tiansform 
is  computed  at  all  frequencies.  In  practice,  one  usually  com¬ 
putes  the  discrete  Fourier  transform  (DFT)  at  a  finite  number 
of  equally  spaced  frequencies.  For  this  case,  the  signal  can  be 
recovered  in  its  aliased  periodic  form  from  the  DFT  |R|.  In 
al  tempting  to  foun  even  extensions  of  a  signal  wlicte  (lie  ex¬ 
tended  signals  aie  consli.itncd  to  be  periodic,  one  has  addi- 
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Fig.  1.  (a)  Causal  signal  Jt(n).  (b)  An  even  extension  of  x(rt).  >>,(«), 
about  n  »  0.  <c)  Another  even  extension  of  Jt(n),  yjtn),  about 

n  «  -0.5. 
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jc(2jV  -  n  -  I), 


0  <«</V 


IV  <  n  <  2N  -  1 . 


Y(k)=  y 


v,l»> 

?  T  ? 

i  T.T 


0  H-1  2H  -2 


It,! 


0  N  - 1  2ft  *1 


IT  T  1  ? 

i  !  T  !  I  | 


0  ft-t  2ft  -1 


|  is  - ! 

I  i  !  f  ?  I  I 


tional  choices  in  defining  those  extensions.  As  an  example, 
let  x(/i),  0 <n<fV-  1,  be  the  sequence  given  by  the  four 
nonzero  samples  in  Fig.  1(a)  (i.e..  tV  =  4).  Fig.  2  shows  four 
different  even  extensions  of  jr(ri).  y,(n)  is  a  (2iV  -  2)-point 
even  extension;  y5(n)  and  Vj(rt)  are  two  different  (2 IV  -  l)- 
point  even  extensions;  and^f/t)  is  a  2(V-point  even  extension. 
Each  of  the  four  extension  definitions  could  form  the  basis 
for  a  DCT  definition.  The  most  common  form  of  the  DCT  is 
the  one  derived  from  the  2W-point  even  extension  y^{n)  and 
is  the  one  discussed  in  this  paper. 


Fig.  2.  Four  different  periodic  even  extensions  of  the  nonzero  x(n) 
samples  in  Fig.  1(a).  (a)  y,(n)  is  a  (2.V  -  2)-point  even  extension, 
(b)  and  (c)  >:<«)  and  are  two  different  (2.V  -  ])-pouit  even 

extensions,  (d)  v4(n)  is  a  2.V-point  even  extension.  The  DCT  dis¬ 
cussed  in  this  paper  is  derived  from  the  2A’-point  even  extension 

y *(’!)■ 


W.,  ~  e-/2n/*r 


Substituting  (1)  in  (3),  we  have; 


A.  Forward  DCT 

We  desire  the  DCT  of  an  /V-point  real  data  sequence  x(n), 
0<n<JV-  1.  The  DCT  is  derived  below  from  the  DFT  of 
a  2(V-point  even  extension  of  x(n). 

Let y(n)  be  a  2JV-point  even  extension  ofx(rt)  defined  by: 


Y(k)  =  £  x(n)  (Vj$  +  2)r'  xC  N  -  n  -  1 ) 

n»  0  n  *.V 


By  changing  the  summation  variable  in  the  right-hand  term, 
noting  that  U'JA--  =  1  for  integer  »t,  and  factoring  out  (VJN.  , 
we  have 


)'(*)  =  w;*/1  £  v(")  I  w"\  +  h';.v*  )  (5) 


The  expression  in  (5 ) may  be  written  in  two  ways 


y{2V-n-  l)=.v<«).  (2) 

Fig.  3(a)  shows  an  example  of  a  signal  *(«),  and  Fig.  3(b) 
shows  the  corresponding  even  extension  of  xDil  as  defined  in 
(1).  Because  of  the  minus  1  on  the  left-hand  side  of  (2).p(«) 
is  not  even  about  tV  and,  therefore,  will  not  have  a  leal  DFT,  as 
we  shall  see  below. 

The  DFT  of  y(n)  is  given  by 


trC/t  +  I)* 


>'(*>=  wi\  -  E  -vt«)  cos  -  ;  ■- 

h  — *  * 


0<k<:\  I  (6) 


FI*)  =  IV,*' 2  2  Re  W*(2  Y  tin)  IV,"* 
n  •  0 


■<  :.v  -  i  { 7 ) 


1 


J 
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obtained  by  taking  ;he  2Af-point  DFT  of  the  original  sequence 
.x(n)  with  N  zeros  appended  to  it,  multiplying  the  result  by 
,  then  taking  twice  the  real  part.  The  latter  method  has 
been  the  one  in  common  usage  (4). 

B.  Inverse  DCT 

Again,  we  shall  derive  the  inverse  DCT  (1DCT)  from  the 
inverse  DFT  (1DFT).  The  IDFT  of  Y(k)  is  given  by 

>'(")=  ^7  £  TO)"#*-  (ID 

"v  *«o 

Since y(n)  is  real,  Y(k)  is  Hermitian  symmetric: 

Y(2N- k)=Y\k).  (12) 

Furthermore,  from  (7),  it  is  simple  to  show  that 

Y( AT)-0.  (13) 

Using  (12)  and  (13)  in  (1 1),  one  can  show  that 


1  „ 

,-slun 


i'  rii>  »;;* 

~  *.i  J 


0<n<2N-  1.  (14) 


Fie  3.  (a)  Original  signal  xin).  0  <  n  <  N  -  1.  (b)  A  2A-point  even 
extension  of  JC(n).  ytn).  (c)  Division  of  yin)  into  its  even  and  odd 
parts  t/( n)  and  w(n). 

By  defining  the  DCT  of  jc(rt)  as  1 

tv-t  irCn  +  Ilk 

C(k)  *  2  £  *(")  cos  0  <  k  <N  -  1,  (8) 

n*0 


we  have,  from  (6)  and  (8), 


Y{k)  =  C(k) 

(9a) 

or 

C(k)  =  W*j*  Ylk) 

(9b) 

and,  from  (7)  and  (9a), 

C(k)  =  2  Re  \wtf  Z  x(n)  H'&l . 

(10) 

n-0 


Substituting  (9a)  in  (14)  and  using  (1),  we  have  the  desired 
1DCT 


C(k) cos 


ir(2n  +  l)k~l 

2N  J’ 


0  <  rr  <  A'  -  1.  (15) 


Equations  (8)  and  (1 5)  form  a  DCT  pair.  Given  C(k),  x(n)  is 
retrieved  by  first  computing  Y(k)  using  (9a),  then  taking  the 
2^^0101  complex  IDFT  implied  by  (14),  which  results  in 
y(n)  and,  hence,  x(n). 


III.  Fast  Cosine  Transform  (FCT) 

A.  Forward  FCT 

We  now  show  that  the  DCT  may  be  obtained  from  the  Ar- 
point  DFT  of  a  real  sequence  instead  of  a  21V-point  DFT, 
resulting  in  a  saving  of  1/2. 

Divide  the  sequence  r(*i)  into  two  A'-point  sequences 


Equation  (9)  specifies  the  relationship  between  the  DCT  of  a 
sequence  and  the  DFT  of  the  2A/-point  extension  of  that 
sequence.  Note  that  C{k)  is  real  and  Y(k )  is  complex.  Y(k) 
would  have  been  equal  to  C(k)  had  the  sequence  y(n)  been 
delayed  by  half  a  sample,  in  which  casey(n)  would  have  been 
truly  even. 

Therefore,  the  DCT  of  x(n)  may  be  computed  by  taking  the 
2 N-poml  DFT  of  y(n),  as  in  (3),  and  multiplying  the  result  by 
<(2 ,  as  in  (9b).  From  (10)  we  see  that  the  DCT  may  also  be 


t'(«)=3’(2n)  ) 

}  0<n<N-l  (16) 

**'(«)=  v(2n  +  1)J 

where  v(n)  and  tv(/i)  are  the  sets  of  even  and  odd  points  in 
.r(/r),  respectively.  Fig.  3(c)  shows  how  this  division  takes 
place  in  the  given  example.  Note  that  each  of  ti(ri)  and  w>(n) 
contain  all  the  original  samples  of  x(ri),  and  that  tv(n)  is 
simply  the  reverse  sequence  of  ii(n).  In  f3ct,  from  (2)  and 
(16),  one  can  show  that 


w(n)  =  v(N  -  n  -  I),  Q<n<N-\.  (17) 

5The  DCT  definition  here  is  slightly  different  from  other  definitions 
(4 1 ,  mainly  in  the  relative  amplitude  of  C(0)  to  that  of  other  terms.  Substituting  (16 )  in  (3),  we  have 
also,  we  do  not  use  an  orthonormali/ing  factor.  The  ranpe  on  k  is  the 

same  here  as  in  the  literature;  however,  in  this  paper  we  shall  also  make  A  - 1  2nk  Af-I  (2n*IU 

use  of  C(A).  which,  from  (8),  is  equal  to  lero  always  since  the  cosine  F(k)=  Z  +  £  w(rr)  .  (18) 

leim  is  rero  for  k  *  N.  n*0  n*0 
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Substituting  (17)  in  (18),  noting  that  ‘  W*,"*  =  *Kv*»  rear- 
ranging  terms.  and  using  (9),  one  cm  show  that 


C(k)  =  2  Re 


X 


n  »0 


u(n)  H'v* 


1. 


(19a) 

The  difference  between  (19a)  and  (10)  is  the  use  of  WN  in¬ 
stead  of  H'a.v  in  the  summation,  and  v(n)  instead  of  jc(n). 
The  result  is  that  one  can  now  compute  the  DCT  from  an 
Af-point  DFT  instead  of  a  2Af-point  DFT.  Equation  (19a)  may 
be  rewritten  as 


C(k)=  2 


y-i 

I 


n-0 


uln)  cos 


ir  (4n  +  l)/c 

2N 


0<k<N-  1. 


09b) 

which  gives  an  alternate  defmtion  of  the  DCT  in  terms  of  the 
reordered  sequence  vhi)  [compare  (19b)  with  (8)]. 

The  sequence  tt(/t)can  be  written  directly  in  terms  of  x(n)\ 


where  [u|  denotes  "integer  part  of  a."  Therefore,  u(/t)  is  ob¬ 
tained  by  taking  the  even  points  in  x(n)  in  order,  followed  by 
the  odd  points  in  their  reverse  order.  Note  that  (20)  applies 
for  any  value  of  A',  odd  or  even. 

Pie  specification  of  the  FCT  is  now  complete;  the  compu¬ 
tational  procedure  is  given  in  Section  lll-B.  We  simply  note 
here  that  since  v(n)  is  real,  its  Appoint  DFT  can  be  computed 
from  the  (,V/2)-point  DFT  of  a  complex  sequence  (see  the 
Appendix). 

B.  Inverse  FCT 

The  idea  here  is  to  compute  t  (/i)  from  the  DCT  first,  then  use 
(20)  to  obtain  x(/i).  Substituting  u(/i)  =>’(2/i)  in  (14),  we 
have 


v(n)  =  -  Re 


Y(k )  h'v"* 


0<n  1. 


CD 


Equation  (21)  indicates  that  r(n)  can  be  computej  using  an 
A-pomt  complex  IDFT  instead  of  the  2,V-point  complex  IDFT 
implied  by  (14)  However,  the  number  of  computations  is 
still  about  twice  that  used  in  the  forward  FCT.  We  now  show 
that  the  inverse  FCT  can,  in  fact,  be  computed  with  the  same 
number  of  computations  as  the  forward  FCT.  Pie  method  is 
to  compute  I'lA)  from  C(k).  then  compute  the  IDFT  of  )'(4) 
to  obtain  fin ) 

Equation  ( ld.il  can  be  rewritten  as 

C(k)  =  Re  [ 2 lv' * \  l'(k)|  (22) 

where  ('(it)  is  the  DFT  oft  (u).  To  compute  l'(A.)  trom  C{k) 
in  (22)  we  need  also  a  knowledge  ol  the  imaginary  part  ol  the 


term  in  brackets.  Denote  the  imaginary  part  by  ■'",(<:)  and  the 


whole  complex  number  by  Cc(k),  where 

Cc(k)  =  C(k)  +jCt(k)  =  2<v  m.  (23) 

then 

V(k)=\w;$cc(k).  (24) 

We  first  need  to  determine  Cj(k).  Using  the  fact  that  V(k) 
is  Hermitian  symmetric 

V(N  -  k)  =  V%k),  (25) 

one  can  show,  using  (23),  that 

Ce(N  -  k)  =  -/<£(*)  =  -  (<:,(*)  +  /C(*)l .  (26) 

From  (23)  and  (26),  we  conclude  that 
C/(k)  -  -C(N  -  k) 
and 

Cc(k)  =  C(k)  -  ,C(N  -  k)  =  2 W*N  V(k).  (27) 


(Note  that  one  can  take  advantage  of  (27)  in  (23)  for  com¬ 
puting  C(&)  since  one  can  compute  C(k)  and  C(N  -  k)  simul¬ 
taneously.)  From  (27),  we  have 

K(it)  *  i  K*  [C(*>  *  >C(N  -  *)1 .  0  <  k  <  X  -  1 .  (28) 

V(k)  is  computed  from  (28)  for  0<k<Nj2,  then  use  (25) 
for  Ac  >  jV/2.  In  computing  F(0),  one  needs  the  value  of 
C(N),  which,  from  (9b)  and  (13),  is  seen  easily  to  be  equal 
to  zero 

C(A0  =  0.  (2d) 

After  computing  K(ifc),  tr(n)  is  obtained  as  the  IDFT 

^/I>=i  z'  ^)^"*.  (30) 

”  *  -o 

It  would  seem  that  (30)  again  requires  an  <V-point  complex 
IDFT.  However,  in  the  Appendix  we  show  how  the  IDFT 
of  a  Hermitian  symmetric  sequence  can  be  computed  using 
an  (,V/2)-poini  complex  IDFT,  the  same  as  in  the  forward 
FCT. 

We  are  now  ready  to  specify  the  complete  procedure  for 
computing  the  FCT  and  the  inverse  FCT. 

FCT  Procedure:  Civen  a  real  sequence  ,x(n).  0  “JiK.V-  1 . 

1)  Form  the  sequence  e(n)  from  (20). 

2)  Compute  ('(it),  0  <  k  <  iV  -  1 ,  the  DFT  of  v(n). 

3)  Multiply  (’(*)  by  2  exp(-/frk/2A0.  From  (27),  we  see 
that  the  real  part  will  be  C(k)  and  the  negative  ol  the  imagi¬ 
nary  part  will  be  CfV  -  k).  Therefore,  the  value  ol  k  is  varied 
in  the  range  0<(<  [,V/2  J . 

IFCT  Procedure  Given  the  DCT  C(Jt),  0  <  k  <  -  I .  and 

C(.V)  =  0 

1)  Compute  V(k)  from  (28). 

2)  Compute  the  IDFT  of  )'(*),  u(»i). 

3)  Retrieve  xln  I  fiom  e(nl  using  (20). 

C.  Compuiationjl  Const Jcra nuns 

Since  most  rescan  hers  have  some  fonn  ol  the  DFT  available 
on  their  computers,  the  FCT  and  IFCT  procedures  given  in 
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the  previous  section  can  be  implemented  easily.  Furthermore, 
the  procedure  is  quite  general  and  may  be  used  for  any  value 
of  jV.  We  have  also  seen  that  the  A'-point  DFT  of  u(/i)  and  the 
Af-point  1DFT  of  K(k)  can  be  computed  from  the  (AV2>point 
DFT  and  (A72)-point  IDFT,  respectively,  of  some  complex 
sequence.  For  a  highly  composite  value  of  N,  one  can,  of 
course,  use  the  FFT  to  great  advantage.  Maximum  savings 
accrue  when  N  is  a  power  of  2.  in  the  latter  case,  if  one  makes 
use  of  the  fact  that  the  sequence  is  real,  the  total  number  of 
computations  for  the  FCT  or  the  1FCT  is  on  the  order  of 
N  log2  N,  the  same  as  in  (1).  The  major  difference  here  is 
that  we  do  not  require  a  specialized  algorithm. 

In  computing  C(k)  from  (27)  one  first  takes  the  Appoint 
DFT  of  u(n),  and  therefore  one  needs  a  table  of  sines  or 
cosines  where  the  unit  circle  is  divided  into N  equal  segments. 
However,  multiplying  afterwards  by  h'*v  requires  a  table 
where  the  unit  circle  is  divided  into  4jV  segments.  Since  k  is 
in  the  range  0  <  k  <  N  -  1 ,  the  JV  values  of  sines  and  cosines 
are  all  in  the  first  quadrant.  This  point  is  made  to  emphasize 
the  fact  that  the  DCT  of  a  sequence  of  length  N  requires  an 
exponential  table  four  times  as  large  as  that  required  for  an 
A'-point  DFT. 

IV.  Two-Dimensional  Fast  Cosine  Transform 
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Fig.  4.  An  example  of  a  2-D  OV,  X  ,V2  (  point  sequence  and  its  (2A!  X 
2jVj)-point  even  exlension.  Here.  -  3  and  \2  -  4  The  sample 
values  are  given  numerically  in  the  figure  The  four  sequences  defined 
in  (39)  in  Ihe  texi  are  indicated  in  the  figure  as  follows  tilled  circles. 
u(fl|.nj);  triangles,  W|(n|.n2);  squares.  wjtHi.nj);  open  circles, 
wjlni.nj). 


where 


In  this  section  we  present  results  analogous  to  the  1-D  case 
given  in  Sections  II  and  III.  We  show  how  the  DCT  of  a  2-D 
real  sequence  {x(nun2),  0  <n,  -  1, 0</t2  <N2  -  1} 

can  be  computed  using  an  (N2  X  iV2)-poiru  real  DFT  instead 
of  the  (2A',  X  2A'j)-point  real  DFT  required  in  the  traditional 
method,  resulting  in  a  saving  of  1/4.  Since  the  methods  used 
here  are  similar  to  the  1-D  case,  no  detailed  derivations  will  be 
given. 

A  Two-Dimensional  DCT  and  1  DCT 
In  a  manner  analogous  to  the  1-D  case  in  (1),  define  a 
(2Af,  X  2A'2)-point  even  extension  of  jt(rt|,n2)  in  the 
and  n2  directions: 


•V,  - 1  v,  - 1 


C(k,,k2)  =  4  Y.  Z  •»(«!. «j)  cos 


n ,  »  0  n :  ■  0 

»(2n2  +  l)k2 


rr(  2/1,  +1 )  At] 

2N2 


cos 


2A', 


0  <S  *,  « .V,  -  1 ,  0  <  *2  <  A'j  -  1  (34) 

is  the  2-D  DCT  of  the  sequence  x(n,.n2).  The  computation 
of  C(k2,k2)  from  either  (32)  and  (33)  or  from  (34)  requites 
one  to  take  the  DFT  of  a  (2A’|  X  2A'2  Fpoint  rejl  sequence. 

From  (32)-(34),  one  can  show  that  >‘(Ar, .  A'2 )  has  the  fol¬ 
lowing  properties: 


/'x(nl,n2); 


0  <rV,  -  I.  0  <  n2  <N2  -  1 


y(n  I,n2) 


I  v(2Ar1  -  n,  -  l.n2); 
j  v(w, ,  2/Vj  -  n2  -  1); 
t^(2Af,  -  n,  -  I,  2M2 


n2  -  1); 


Ar,  <rt,  <2Ar,  -  1,  0<ri2  s;,V2  -  1 
0  <n,  </V,  -  1,  ,V2  <n2  <  2rVj  -  1 

Af,  <«,  <2Af,  -  l.  A',  <n2  <  2<V2  -  1. 


(31) 


Fig.  4  shows  an  example  where  =3  and  A'2  =4;  the  num¬ 
bers  in  the  figure  are  the  sample  values.  Note  that  the  number 
of  samples  in  i(n, , n2 )  is  4/V,Af2,  i.e.,  four  times  that  in 
x(n,./t2).4  The  2-D  DFT  of  v(nt,«2)  is  defined  by 

2 N,  -1  2.V,-I 

m,,**)-  z  £  U2) 

n,  «0  fl,*0 

From  (3 1 )  and  (32),  one  can  show  that 

Y(ki,k2)=Wl*N''iw;*'i'1C(kuk^  (33) 

4 In  rite  m  D  cjse,  the  number  of  samples  in  the  extended  sequence 
>‘ln i ,  •  ,  nm )  is  2m  times  the  ntimber  of  simples  in  x(n1 ,  •  •  .  nm). 


v (2,v,  -Alf2<V,  -k2)=Y*{kltk2) 

Y(2Xt  -  A, . 0)  =  rut,,0) 

(35) 

K(0.  2 A'j  -  *2)  =  K*(0.  k:) 

K(/V,,  k2 1=  »'(*,.' V2)=0. 

The  first  three  equations  constitute  the  Meinntian  symmetric 
properties  in  2-D.  and  they  are  derivable  fiom  (32)  for  real 
i’(n,,n2).  The  last  equation  in  (35)  is  analogous  to  (13)  in 
ID,  and  is  a  consequence  of  the  particular  type  of  even  sym¬ 
metry  of  v(n, ,  n2 ).  Substituting  (351  in  the  equation  for  ihe 
2  D  IDFT 
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J2 


,  22V, -I  22V,  *  I 

Z  £  nkuk2)w;*'k'w]nJ\ 

L  -A  b  -n  1  * 


kfr0  k% 


one  can  show  that  the  2-D  IDCT  of  C(*, ,  k2)  is  given  by 

,  JV,  - «  jvs  - 1 

)  =  r rrr  £  £  C\kt,k2) 


(36) 


*fr0  *% 


7r(2n,+l)fc,  jr(2n2  +  l)k5 
cos  - — — - COS - 


where 


<:'(*.,*,)  =  < 


IN, 

f  C(0.0)/4, 
C(*,,0)/2, 


2Nt 

*,  ~  0,  *j=0 

*,  *  0,  *3=0 


(37) 


C(0,*3)/2,  *1*0,  *3^0 

L  C(*|,  *3),  *1  ^0,  *3  &  0. 


(38) 


*(*i.flj)  may  be  computed  using  a  2-D  (2rV,  X  2JVj)-point 
IDFT. 


5.  Two-Dimensional  FCT  and  1FCT 

The  corresponding  fast  algorithms  in  the  2-D  case  are  ob¬ 
tained  in  a  manner  analogous  to  the  1-D  case.  We  divide 
.v(n(,Mj)  into  four  (iV,  X  Appoint  sequences,  starting  with 
the  four  points  at  (0.  0),(1,0),  (0, 1),  and(l,  1),  respectively, 
and  taking  every  second  point  after  that.5  The  four  sequences 
are  then  given  by 

*e(«, , er2)  =>’(2ei1,  2n2) 

H'i('ii,'t3)=>’(2/i|  +  1, 2n:) 

(39) 

**'j(nl,/ii)=y(2ni,  2/ij  +  1) 

*v3(/»,, «,)  =  y(2n,  +  1,2/73  +  1) 


H'i(2ti.«2>  =  t'(A'i  -n,  -  l.n,) 

H'2(«1,/»2)  =  e(/7,,A,2  -  A,  -  I)  (40) 

*'3('ii.«j)  =  t;(7V1  -  «,  -  1,/Vj  -  n}  -  )). 

Let  F(*,,*,)be  the  2-D(ArI  X  Appoint  DFT  of  i)(n, ,  n3): 

^,.*l)  =  JV£  £*  «(/»„/.,)  wZ'h'  (4i) 

/»,  *0  /I,  1  S 

Then,  by  substituting  (40)  and  (39)  in  (32).  and  using  (33). 
one  can  show  that 

C(*„*2)  =  2Re{^  (W*;j  K(*lf*2) 

+  W;*;K(*,.Af, -*3>j}  (42a) 

or  equivalently, 

C(* , ,  *j )  =  2  Re  {  ^  _  V(k, ,  *j ) 

+  W;*;K(,V,  -*,.*,)!} .  (42b) 

From  (42),  it  is  simple  to  show  that 


C(fc,.*2)  =  4  ^'^..^Jcos^-  _ 

n  .Q  n  .0  2.V, 


it(4/r,  +  1  )*, 


•  cos 


rr(4/»3  +  1)*3 
~Nj 


(43) 


Jt  is  clear  (rom  (42)  and  (4  I )  that  C(*, ,  *3 )  can  be  computed 
using  an  (rV,  X.VjVpoint  DFT  instead  of  a  (2.V,  X  2A  ,  F 
point  DFT.  resulting  in  a  saving  of  1/4. 

The  only  remaining  step  is  how  to  obtain  c(/i, ,  rij )  directly 
from  .t(rr i ,  n2 )  instead  ot  using  (39).  One  can  show  that 


T x(2n,,  2n2 ); 
x(2 rV,  -  2/r,.  -  1.2/rj ); 


o(«i.nj)sA 


0<7I, 

< ,l(  < jV,  -  i .  o<*,<  1  j 

0  <  /7,  <  p— [— y-j  <  77,  <,v,  - 

Lt(2;V,  -  2/i,  -  1 .  2/V,  -  2/1,  -  I);  <»»,  <.V,  -  1.  <.V,  - 


(44) 


2(2/7,,  2 At,  -  2/7,  -  I); 


where  0  <  n,  ’i,X,  -  I  and  0  <  n2  <,V,  -  | ,  |n  the  example 
in  Fig.  4  note  how  each  of  the  sequences  in  (39)  contains  all 
of  the  samples  in  x(n,.n2).  but  in  a  reordered  fashion.  From 
(39)  and  (3  1 ),  one  can  show  that 

In  the  m  Q  case,  y(if,,-  ■  ,/?„,)  is  dtviJcd  into  2,n  sequences, 
sl  itting  with  each  of  the  2m  corners  of  the  unit  m  D  cube,  from 
If).  0.  ■  .  O)  to  1 1 ,  1 .  -  -  -  .  II.  and  takme  every  second  point  alter  Ihjt. 
The  sequence  that  begins  at  (i,.ij.  ■  •  ■  where  each  i,-0  ot  t  is 
defined  by  .it  2/t,  +«,.M,  +  t,.  im\ 


The  2-D  IFCT  of  C(*t ,  *. )  is  obtained  by  first  computing 
l  (*| .  *2 )  from  £<*,.*,).  In  a  manner  analogous  to  the  1  D 
case,  wc  define  a  complex  quantity  C„.(*,.*, )  by  not  taking 
the  real  part  in  (42a).  and  then  show  that 

C,.(*|.*,)  =  C(*,,k;)-/C(.V,  *,.*,).  (45) 

Find  Cc  ( .V ,  -  -  k,  ).  then  add  and  subtract  the  result 

from  C,  (*, .  *, )  The  answer  can  he  shown  to  reduce  to 

I  (*i .  *3 )  =  ,{  b'a  v,  *+’a  v  t  |C~(*i ,  *, ) 


MAKHOUl  .  fast  cosine  transform 


S3 


-  C(A,  j  -/[C(iV,  -*,.*,) 

*C{kl,N1-ki) )}  (46) 

where  0<Ar,  <Ar,  -  1  and  0<Jt2  </V1  -  1.  However,  since 
y(ki,k2)  is  Hermitian  symmetric,  one  need  compute  only 
half  the  values  in  (46).  In  performing  the  computations,  one 
needs  the  fact  that 


C(Ar,.*,)  =  C(*1, A2)  =  0,  all  kt  and  k2,  (47) 

which  can  be  shown  to  be  true  from  (33)  and  (35),  or  from 
(43). 

After  V(kltk2)  is  computed  from  (46)  and  (47),  v(nu  n2) 
is  evaluated  from  the  IDFT 


,  IN, -i 

v(n„n2)  =  ~-  £  £  rVti,k2)W-m 
yvi/vt  a.-oaj-o 


n,  A, 


w~n  1*2 
WN,  * 


(48) 

The  sequence  x(nifn2)  is  then  retrieved  from  e(nl,n2)  by 
using  (44). 


V.  Conclusion 

We  showed  how  the  DCT  of  an  Appoint  sequence  may  be 
derived  from  the  DFT  of  a  2A'-point  even  extension  of  the 
given  sequence.  Then,  we  presented  a  fast  algonthm  (FCT), 
first  developed  by  Narasimha  and  Peterson  (7],  which  allows 
for  the  computation  of  the  DCT  of  a  sequence  from  just  an 
A'-point  DFT  of  a  reordered  version  of  the  same  sequence, 
with  a  resulting  saving  of  1/2.  Therefore,  one  can  use  existing 
FFT  software  to  compute  the  DCT.  For  JV  a  power  of  2.  the 
number  of  computations  is  comparable  to  that  reported  by 
Chen  et  al.  ( 1  ] ,  who  used  a  specialized  algorithm. 

The  FCT  algorithm  in  this  paper  is  more  general  than  that  of 
[7]  in  that  N  may  be  odd  or  even.  Also,  an  algorithm  was 
developed  here  for  computing  the  inverse  FCT  using  the  same 
number  of  computations  as  in  the  forward  method. 

The  method  was  then  extended  to  the  2-D  case,  where  a 
saving  of  1/4  was  achieved.  The  method  can  be  generalized 
to  compute  an  m-D  DCT  using  an  m-D  DFT,  with  a  saving 
of  1/2'”  over  traditional  methods  that  use  the  DFT. 


Appendix 

DFT  and  IDFT  tor  a  Real  Sequence 
Let  ir(rt).  0  <  n  <  A'  -  1  be  a  real  sequence ,  where  A'  is  divis¬ 
ible  by  2.  We  wish  to  compute  the  DFT  of  v(n).  I'(A),  0  < 
k  A  -  1,  using  an  A/2-point  DFT.  Fxcept  for  the  How- 
graphs  in  Fig.  5,  the  procedure  given  below  is  well  known  (see, 
for  example,  [6] ). 


DFT  Procedure 


1)  Place  the  even  and  odd  points  of  ir(ri)  in  the  real  and 
imaginary  parts,  respectively,  of  a  complex  vector  r(n)  = 
tH(n)  +  ///(«),  wh^re 


f«(u)  =  v(2n)  I  /V 

i 

f/(rr)  =  u(2rt  +  1 1 J 


(A!) 


Fig  S.  (a)  Supplementary  flowgraph  for  compuung  the  I  FT  of  a  real 
sequence.  The  computation  in  the  figure  is  performed  for  0  <  *  < 
|A74]  (b)  Supplementary  flowgjaph  for  computing  the  IFFT  of  a 

Hermitian  symmetric  sequence  The  compulation  is  performed  for 
0  <k<  1  A'/4  J . 


2)  Compute  the  Af/2-point  DFT  of  t(n),  T(k),  0 <k< 
A/2-1 . 

3)  Compute  V(k)  from  T(k)  using  the  formula  [5] 

V(k)  =  ±  [r(A)  t  T*  (£. 

(A-2) 

The  computations  in  the  last  step  can  be  made  more  efficient. 
From  (A-2),  one  can  write 


Given  T{k),  Fig.  5(a)  shows  the  Howgraph  (5]  that  imple¬ 
ments  (A-2)  and  (A-3)  to  compute  I  '(A).  Note  that  the  values 
of  V(k)  arc  computed  two  at  a  time.  Theiefore.  the  value 
of  k  in  Fig,  5(a)  should  range  between  0  <  k  <  (A’/4j.  The 
other  values  of  V(k),  k> A72.  may  be  obtained  by  noting 
that  F(A.)  is  Hermitian  symmetric.  There  are  two  points  in 
l'(k)  that  are  real  and  require  no  multiplication.  They  are 

F(0)  =  Re  [r(0)i  +  Im  (T(0)| 

ke  [T(0)]  -  lm  [T(0)).  <A'4) 


Also,  it  A'  is  divisible  by  4.  one  can  show  that 


Given  a  Hermitian  symmetric  I’M).  0  <  A  i,V  -  I,  we  wish 
to  compute  the  IDFT,  c(n).  0  <  n  <  .V  -  1  From  (A-2)  and 
(A-3),  one  can  easily  solve  for  T(k)  and  T*(A72  -  Jt).  The 
resulting  equations  can  be  implemented  using  the  flowgiaph 
in  Fig.  5(b).  The  IDFT  procedure  is,  then,  as  shown  in  the 
following. 
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IDFT  Procedure 

1)  Compute  T(k)  from  V{k)  using  the  flowgraph  in  Fig. 
5(b),  where  the  range  of  k  in  the  figure  is  0  <  k  <  [Nj4). 

2)  Compute  the  (.V/2>point  IDFT  of  T(k),  r(n\  0  <n< 
(iV/2)  -  I. 

3)  o(n)  is  obtained  from  r(n)  by  using  (A-l). 

Finally,  if  N  is  not  divisible  by  2,  one  can  compute  the  DFT 
of  two  separate  sequences  using  the  same  method  given  above 

(6). 
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ABSTRACT 

In  embedded  multirate  speech  coding,  it  is 
desired  to  have  a  transmitter/receiver  system  that 
operates  efficiently  over  a  wide  range  of  channel 
transmission  rates.  We  are  currently  investigating 
a  system  based  on  adaptive  transform  coding  of 
speech,  in  which  we  code  and  transmit  the  system 
parameters  and  the  discrete  cosine  transform  (DCT) 
coefficients  of  the  fullband  linear  prediction 
residual  waveform.  The  multirate  property  of  the 
system  is  achieved  by  allowing  the  channel  to 
discard  some  of  the  bits  generated  by  the  high  data 
rate  transmitter.  Stripping  off  bits  results  in  an 
absence  of  DCT  components,  which  the  receiver 
regenerates  by  a  spectral  dupl  cation  method.  An 
inverse  DCT  at  the  receiver  yields  the  time  domain 
residual  waveform  to  be  used  as  input  to  the  linear 
prediction  synthesis  filter.  The  lowest  data  rate 
achievable  by  the  system  is  about  2.5  kb/s,  in 
which  case  the  system  reduces  to  a  narrowband  LPC 
pitch-excited  vocoder. 

1.  INTRODUCTION 

In  multirate  speech  coding  schemes,  it  is 
desired  to  have  a  transmitter/receiver  system  that 
can  operate  over  a  wide  range  of  channel 
transmission  rates.  With  the  added  constraint  of 
embedded  coding,  the  channel  is  allowed  to  discard 
some  of  the  transmitted  bits  at  each  frame  to 
achieve  lower  data  rates.  Such  a  system  lends 
itself  well  to  a  packet-switched  communication 
network  (such  as  the  ARPANET),  where  traffic 
congestion  would  be  alleviated  by  lowering  the  data 
rate. 

In  this  paper  we  describe  a  hybrid 
linear-prediction  transform-coding  system  that  can 
operate  in  the  range  2500-16000  b/s,  with  inputs 
bandlimited  to  3*33  kHz  and  sampled  at  6.67  kHz. 

2.  STSTEM  DESCRIPTION 

Taking  linear  prediction  (LP)  as  the  basic 
method  of  spectral  envelope  representation,  we  code 
and  transmit  (1)  the  system  parameters  (LPC 
coefficients,  gain,  pitch,  and  pitch  coefficient), 
and  (ii)  the  discrete  cosine  transform  (DCT) 
coefficients  of  the  LP  residual,  using  a  modified 
adaptive  transform  coding  (ATC)  scheme  [1].  A 
block  diagram  of  the  system  is  shown  in  Fig.  1. 

At  each  frame,  the  codes  representing  the 
system  parameters  are  transmitted  first.  These 


(b)  ft«c«iv«r 


Fig.  1  Block  diagram  of  multirate  speech  transform 
coder. 

codes  are  then  followed  by  the  codes  representing 
the  DCT  components.  The  transmitter  always 
operates  at  the  same  fixed  bit  rate;  it  is  the 
channel's  function  to  discard  some  of  the 
transmitted  bits  to  alleviate  traffic  congestion. 
Thus,  the  maximum  data  rate  of  16,000  b/s  is 
achieved  when  the  DCT  of  the  fullband  residual  is 
transmitted.  The  minimum  data  rate  achievable  by 
the  system  takes  place  when  all  the  codes 
representing  the  DCT  components  are  suppressed  by 
the  channel  and  only  the  system  parameters  are 
transmitted.  In  such  a  case,  the  receiver  becomes 
identical  to  that  of  a  narrowband  pitch-excited  LPC 
vocoder  operating  at  2500  b/s. 

Intermediate  data  rates  are  achieved  by 
stripping  off  bits  from  the  high  data  rate  system. 
Stripping  off  bits  results  in  the  suppression  of 
low-amplitude  frequency  components  and/or 
high-frequency  components.  This  aspect  of  the 
system  will  be  explained  in  Section  *t.  Here,  we 
point  out  that,  at  the  intermediate  data  rates,  the 
receiver  regenerates  the  missing  frequency 
components  to  restore  the  fullband  DCT .  An  inverse 
DCT  yields  the  time-domain  residual  waveform  which 


Is  used  as  Input  to  the  all-pole  LP  synthesis 
fUter. 

It  Is  worth  mentioning  here  that  the 
transmitter  Itself  can  strip  off  bits  prior  to 
transmission  in  the  same  manner  as  Is  done  by  the 
channel.  Thus,  when  a  system  Is  first  turned  on, 
the  Initial  bit  rate  need  not  be  high  and  traffic 
congestion  can  be  avoided.  Note  that  the  receiver 
does  not  need  to  know  where  In  the  system  the  bits 
were  discarded. 

One  salient  feature  of  the  above  described 
system  is  that  it  Is  a  frequency-domain  baseband 
coder.  It  has  the  flexibility  that  the  width  of 
the  baseband  can  vary  In  time,  depending  on  the 
number  of  bits  retained  by  the  channel  at  each 
frame,  thus  accommodating  various  bit  rates  without 
affecting  the  analysis,  done  at  the  transmitter  and 
without  the  use  of  lowpass  filters.  Also,  the 
advantage  of  transmitting  the  DOT  of  the  residual 
Is  that  the  latter  has  a  flat  spectral  envelope; 
It  lends  Itself  well  to  our  previously  developed 
methods  of  high-frequency  regeneration  by  spectral 
duplication  [2].  Once  a  flat-spectrum  excitation 
signal  Is  derived  from  the  received  baseband,  the 
synthesis  filter  yields  an  output  spectrum  that  Is 
close  to  the  spectrum  of  the  Input  signal. 

The  LP  synthesis  filter  also  helps  smooth  the 
frame-boundary  discontinuities  caused  by 
frequency-domain  quantization  (time-domain 
aliasing).  In  fact,  unlike  ordinary  ATC  schemes, 
we  have  found  that  no  overlap  between  frames  is 
necessary  when  we  use  the  DOT  of  the  residual. 

finally,  a  further  advantage  of  the 
frequency-domain  approach  is  the  gain  in 
slgnal-to-nolse  ratio  afforded  by  ATC  over  IP CM  of 
speech.  He  have  shown  previously  [1]  that  taking 
the  OCT  of  the  residual  appropriately  rather  than 
that  of  the  speech  signal  achieves  the  same  gain  la 
slgnal-to-nolse  ratio. 

3.  TRANSFORM  COOING 

ATC  of  speech  has  been  described  In  detail  in 
the  literature  t3,4J.  Here  we  review  it  briefly, 
mainly  to  point  out  the  places  where  our 
Implementation  differs  from  that  of  others.  An 
Important  step  In  ATC,  bit  allocation,  requires  a 
spectral  model  for  the  speech  signal.  The  model  we 
are  currently  using  Is  given  by 

t/|H<w)l  «  1/[lA(w)| IP(w)l]  (1) 

in  which  lA(w)|  is  the  magnitude  of  the  DPT  of  a 
9th  order  L?  Inverse  filter  derived  from  the  speech 
signal,  and  lP(w)l  Is  the  magnitude  of  the  DPT  of  a 
one-tap  pitch  Inverse  filter  derived  from  the  LP 
residual.  Prior  to  bit  allocation,  the  spectral 
model  given  la  (1)  Is  weighted  to  achieve  a  noise 
spectrum  with  a  desired  envelope  given  by 
1/IA(w)|2\  as  Is  done  la  (A].  The  bit  allocation 
process  Is  la  fact  quantization  of  the  weighted 
spectral  model  of  the  speech  signal  on  a 
logarithmic  scale.  It  caa  be  described  by  means  of 
the  following  equations: 


bt  «  b0  *  [20 

l°*l0(Ai/Hl)l/s  .  LUSH 

(2) 

and 

*1 

«  nax{0,  [bj*  8l  ) 

N  - 

(3) 

such 

that 

r  b.  »  B 
i»1  1 

(4) 

where 

represents 

the  value  of  lH(w)l  at 

the 

discrete  set  of  frequencies  w^,  bn  Is  the  average 
number  of  bits  per  sample  (B/N),  N  Is  the  number  of 
OCT  points,  B  Is  the  total  number  of  available  bits 
per  frame,  S  is  the  quantization  step  size  lc 
decibels,  b^  Is  the  ( fractional )m  allocated 
codelength  In  bits  at  frequency  w^,  and  b^  Is  the 
Integer  codelength  corresponding  to  b^.  In  (3), 
the  symbol  1*1  denotes  taking  the  Integer  value 
nearest  to  the  argument  and  the  maxi')  function  is 
used  to  prevent  the  allocation  of  negative 
codelengths.  The  adjustment  constant  0  In  (3)  is 
varied  Iteratively  until  the  condition  In  (4)  Is 
satisfied. 

An  important  aspect  of  the  quantization  scheme 
given  In  (2)  and  (3)  Is  the  choice  of  step  size  S. 
Traditionally,  the  "6  dB  per  bit"  rule  has  been 
used,  l.e.,  S=6.02  dB.  However,  the  increase  In 
signal-to-quantlzatlon-noise  ratio  for  the  optimal 
5, -bit  Max  quantizers  [5]  used  In  quantizing  the 
DcT  coefficients  ranges  from  4.4  dB  to  5.8  dB  for 
bits  In  the  range  0ib.£5.  Therefore,  Ideally,  one 
should  perform  the  bit  allocation  with  unequal 
quantization  steps.  For  simplicity,  we  have  kept 
the  uniform  quantization  In  (2)  with  a  compromise 
value  of  S=5  dB,  and  found  this  scheme  to  yield  a 
higher  slgnal-to-nolse  ratio  than  the  case  with  S-6 
dB. 

In  the  system  we  are  presenting  here,  we 
perform  the  bit  allocation  over  the  total  signal 
bandwidth  and  transmit  the  DCT  components  of  the 
fullband  residual.  The  Initial  quantization 
accuracy  Is  determined  mainly  by  the  total  number 
of  bits  used  at  each  frame,  the  step  size  S,  and 
the  value  of  Y  for  noise  shaping.  However,  the 
accuracy  and/or  bandwidth  of  the  received  DCT 
components  Is  further  affected  by  the  number  of 
bits  discarded  by  the  channel.  This  last  point  Is 
the  subject  of  the  next  section. 

4.  EMBEDDED  CODING 

As  mentioned  In  Section  2,  the  transmitter 
transmits  at  each  frame  a  block  of  bits,  which  is 
divided  into  two  major  parts.  The  first  part 
contains  the  bits  representing  the  system 
parameters  and  the  second  part  contains  the  bits 
representing  the  DCT  components.  It  is  assumed 
that  the  channel  strips  off  bits  starting  at  the 
end  of  a  block,  thereby  discarding  bits  that 
represent  DCT  components.  The  codes  representing 
the  DCT  components  are  arranged  in  a  certain  order 
prior  to  transmission.  This  ordering  determines 
which  bits  get  discarded.  To  study  the  tradeoff 
between  the  number  of  transmitted  bits,  the 
quantization  accuracy,  and  the  number  of  received 
frequency  components,  we  Investigated  three 
ordering  techniques.  In  all  three  techniques,  to 
be  described  below,  we  assume  that  the  receiver 
decodes  the  system  parameters  and  performs  the  bit 
allocation  as  was  done  at  the  transmitter.  This  is 
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d  practice  In  ATC.  In  addition,  we  require 
that  the  receiver  know  how  many  bits  are  r-celved 
,>ach  f-  so  that  It  knows  where  the  next  frame 
begins.  This  last  piece  of  Information  is  passed 
along  by  the  channel  Itself. 

The  first  bit-ordering  technique  we 
Investigated  is  the  simplest:  the  codes  are 
arranged  by  order  of  Increasing  frequency.  When 
the  channel  strips  off  bits  from  the  end  of  each 
block,  the  high-frequency  components  are  discarded 
first.  The  remaining  codes  represent  a  low 
frequency  portion  of  the  total  bandwidth  referred 
to  as  a  baseband.  As  in  a  baseband  coder,  the 
receiver  regenerates  the  missing  high- free ueney 
components.  We  use  the  method  of  high-frequency 
regeneration  (HFR)  by  spectral  duplication,  which 
is  explained  in  Section  5. 

In  the  second  ordering  technique,  the  DCT 
codes  are  broken  down  into  individual  bits.  The 
bits  are  then  grouped  together  by  order  of 
significance,  with  the  most  significant  bits  in  the 
first  group  and  the  least  significant  bits  in  the 
last  group.  To  explain  this  grouping  method,  let 
us  assume  that  the  bit  allocation  produces 
codelengths  between  1  and  5  bits.  Therefore,  the 
receiver  expects  to  see  5  groups  of  bits.  The 
first  group  contains  the  most  significant  bit  of 
all  5-bit  codes  (in  order  of  increasing  frequency). 
The  second  group  of  bits  contains  the  next  most 
significant  bit  of  all  5-bit  codes  and  the  most 
significant  bit  of  all  4-blt  codes.  And  so  on,  all 
the  way  to  the  last  (5th)  group  of  bits  which 
contains  the  least  significant  bit  of  all  the 
codes.  When  the  channel  strips  off  bits,  the  first 
bits  to  be  discarded  are  the  least  significant 
bits  of  each  transmitted  code,  resulting  in 
decreased  quantization  accuracy.  For  example,  a 
OCT  component  originally  coded  into  3  bits  will  be 
decoded  by  means  of  the  2-blt  decoding  table  if  its 
least  significant  bit  has  been  dropped. 

In  the  third  ordering  technique,  the  codes  are 
grouped  together  according  to  their  length.  Thus, 
the  receiver  expects  to  see  5  groups  of  codes,  with 
the  first  group  containing  all  5-blt  codes,  the 
second  group  all  4-blt  codes,  and  so  on,  with  the 
last  group  containing  all  1-blt  codes. 

The  second  and  third  ordering  techniques 
described  above  are  more  complex  than  the  first, 
since  they  require  arranging  the  data  in  the  order 
of  decreasing  codelength.  Informal  listening  tests 
have  shown  that  the  quality  of  the  reconstructed 
speech  when  using  the  baseband-coder  approach  (the 
first  ordering  technique)  is  superior  to  that 
obtained  by  the  second  and  third  techniques.  In 
general,  our  experience  has  been  that  the  details 
of  the  low-frequency  components  of  speech  are 
perceptually  more  important  than  those  at  high 
frequencies.  Thus,  the  task  is  to  find  a  good 
compromise,  at  a  given  bit  rate,  between  baseband 
width  and  quantization  accuracy.  At  present,  we 
feel  that  the  first  technique  is  giving  the  best 
overall  speech  quality  for  bit  rates  in  the  range 
6.4  to  9.6  kb/s.  We  have  not  compared  the  three 
techniques  at  bit-rates  above  9.6  kb/s.  However, 
it  is  worthwhile  pointing  out  here  that  we  a.  •. 


seeking  one  single  technique  that  will  perform 
uniformly  well  over  the  whole  range  of  data  rates, 
because  we  are  excluding  the  possibility  of 
changing  the  coding  algorithm  while  the  system  is 
in  operation. 

5.  HIGH  FREQUENCT  REGENERATION 

We  have  presented  elsewhere  [2]  new  HFR 
methods  based  on  duplication  of  the  baseband 
spectrum.  In  particular,  we  presented  time-domain 
systems  that  perform  spectral  duplication  by 
spectral  folding  and  by  spectral  translation.  We 
also  presented  a  frequency-domain  system  [ 1 ]  that 
performs  HFR  by  spectral  translation  of  the 
baseband.  The  principle  of  the  method  is  based  on 
the  fact  that,  in  the  adaptive  transform  baseband 
coder,  the  baseband  OCT  components  can  be  easily 
duplicated  at  higher  frequencies  to  obtain  the 
fullband  excitation  signal.  Our  present  HFR  method 
aims  at  duplicating,  as  closely  as  possible,  the 
original  fullband  DCT  of  the  residual,  while 
accommodating  the  variable  baseband  width  aspect  of 
the  present  multlrate  system.  The  method  is  as 
follows. 

The  transmitter  assumes  a  (fixed)  nominal 
baseband  width  of  1000  Hz.  Thus,  the  simplest 
spectral  translation  method  would  be  to  duplicate 
the  region  from  0  to  1000  Hz  onto  the  regions  from 
1000  to  2000  Hz  and  from  2000  to  3000  Hz.  In 
addition,  to  lock  the  high-frequency  Interval  into 
place,  by  exploiting  the  quasi-harmonic  structure 
of  the  speech  spectrum,  we  shift  the  baseband 
around  its  nominal  position  and  correlate  it  with 
the  corresponding  original  DCT  components  that  are 
in  the  region  1000  to  2000  Hz.  The 
cross-correlation  is  done  at  the  transmitter  where 
the  original  DCT  of  the  fullband  residual  is 
available.  The  same  process  is  repeated  for  the 
next  frequency  band.  Short  lags  from  -3  to  *4 
spectral  points  are  considered.  (The  total 
bandwidth  is  128  points.)  The  optimal  location  la 
then  chosen  to  be  at  the  positive  maximum  value  of 
the  cross-correlation.  Thus,  we  require  an 
additional  3  bits  of  side  information  for  each  of 
the  two  high-frequency  bands.  The  additional  6 
bits  are  transmitted  along  with  the  system 
parameters. 

At  the  receiver,  the  decoded  baseband  is 
translated  up,  starting  at  100C  Hz  (and  2000  Hz  for 
the  next  band)  and  is  moved  further  by  a  small 
amount  as  indicated  by  the  3-blt  HFR  code. 

In  practice,  there  are  three  deviations  from 
the  simple  algorithm  described  above.  The  first  is 
that  the  first  few  DCT  components,  starting  at  d.c. 
and  up  to  half  the  pitch  frequency,  are  not 
duplicated  onto  the  high-frequency  bands,  nor  are 
they  considered  in  the  correlation  method  described 
above.  Second,  we  found  that  spectral  flattening 
of  the  baseband  at  the  receiver  prior  to  HFR 
improves  the  speech  quality.  The  third  deviation 
from  the  algorithm  is  due  to  the  fact  that  the 
received  baseband  is  seldom  equal  to  1000  Hz.  In 
fact,  it  varies  from  frame  to  frame.  We  have 
devised  certain  modifications  to  the  method  to  deal 
with  that  problem  appropriately. 


6.  RESULTS 


We  performed  a  large  number  of  experiments  to 
study  the  tradeoffs  between  all  the  Interacting 
aspects  of  the  system  described  In  this  paper.  We 
summarize  our  results  briefly  below.  The  basic 
system  is  governed  by  an  already  existing  ARPANET 
LPC  vocoder  operating  with  a  frame  size  of  19.2  as, 
i.e.,  128  points,  and  transmitting  9 

log-area-ratios  each  frame  coded  into 

5, 5, 5, 4, 4, 4, 3, 3,  and  3  bits,  respectively. 
Together  with  pitch  and  gain,  the  system  parameters 
require  a  total  of  48  bits.  Thus,  the  basic  LPC 
vocoder  operates  at  a  bit-rate  of  2500  b/s.  In 
addition,  we  require  4  bits  for  the  pitch  tap  and  6 
bits  for  the  HFR  codes,  bringing  the  side 
information  to  58  bits  per  frame.  The  remaining 
bits,  which  determine  the  total  bit  rata,  are  used  , 
to  code  the  OCT  coefficients.  We  investigated  the 
following  aspects  of  the  system. 

a.  Initial  Quantization  Accuracy.  To  study  the 
interaction  between  quantization  accuracy  and 
the  number  of  received  OCT  components  at  a  given 
data  rate,  we  coded  the  fullband  OCT  at  an 
average  of  2,  2.25,  2.5,  and  3  bits  per  sample. 

In  our  experiments,  there  was  no  maximum  limit 
on  the  bit  rate  for  the  fullband  case,  although, 
in  practice,  the  system  will  be  operated  at  9.6  . 

kb/s  or  below. 

b.  Noise  Shaping.  We  used  various  values  of  y 
ranging  between  0  and  1 .  Clearly,  for  y  oloser 
to  1,  the  available  bits  are  spread  more  evenly 
in  the  frequency  range,  resulting  in  a  larger 
number  of  received  OCT  components  at  a  given  bit 
rate,  at  the  expense  of  coarser  quantization  in 
the  low-frequency  region  for  voiced  sounds. 

I 

c.  Embedded  Coding.  We  simulated  the  three 

ordering  techniques  described  in  Section  4  and 
evaluated  informally  the  speech  quality  obtained 
with  each.  For  each  technique,  we  optimized  the 
system  in  terms  of  the  total  fullband  rate  and 
the  value  of  Y  as  in  (a)  and  (b)  above. 

d.  High-Frequency  Regeneration.  As  described  in 

Section  5,  the  transmitter  assumes  a  nominal 
baseband  width  for  which  It  computes  the  HFR  I 
codes.  We  investigated  the  use  of  3-blt  and 

2-blt  HFR  codes,  and  the  use  of  a  nominal 
baseband  width  of  800,  900  and  1000  Hz. 

For  each  choice  of  the  above  described 

parameter  settings  we  tasted  the  system  at  9.6 
kb/a,  7.2  kb/s  and  6.4  kb/s,  although,  in 
principle,  the  data  rate  can  be  set  by  the  channel 
to  an  arbitrary  value.  All  tests  were  done  with  5 
male  and  5  female  sentences.  Our  present  choice  of 
a  good  compromise  system  operating  uniformly  well 
over  the  desired  data  rates  1s  one  where  we  code 
the  fullband  DCT  of  the  residual  at  an  average  of 
1.95  bits  per  sample,  l.e.,  where  the  maximum  bit 
rate  is  16  kb/s.  The  value  of  Y  is  0.9,  and  the 
embedded  coding  technique  is  the  first  (baseband 
coding),  with  a  nominal  baseband  width  of  1000  Hz  | 
ana  3-bit  HFR  codes.  The  data  rates  of  9.6,  7.2 
and  6.4  kb/s  are  achieved  by  keeping  at  each  frame 
124,  80,  and  65  bits,  respectively,  out  of  the 


maximum  total  of  250  bits.  The  average  width  c 
the  received  baseband  for  the  three  cases  is  140C 
870,  and  670  Hz,  respectively. 

At  present,  we  feel  that  the  above  describe 
system  is  providing  us  with  very  good  speec 
quality  at  9.6  kb/s,  good  speech  quality  at  7. 
kb/s,  and  reasonable  quality  at  6.4  kb/s.  Th 
problem  at  bit  rates  of  6.4  kb/s  or  below  is  tha 
the  received  baseband  becomes  too  narrow,  whlc 
results  in  appreciable  roughness  in  the  code 
speech  and  some  "thuds."  Also  noticeable  at  6. 
kb/s,  especially  for  female  voices,  is  th' 
reverberant  quality  of  the  reconstructed  speech. 

7.  CONCLUSION 

We  presented  in  this  paper  a  simple  anc 
effective  embedded-code  multirate  speech  transform 
coder.  In  going  to  a  frequency-domain  approach,  w « 
feel  that  we  have  accrued  several  advantages  over 
time-domain  coding  schemes.  Some  of  these 
advantages  are:  the  ease  with  which  spectral 
noise-shaping  and  quantization  accuracy  can  be 
controlled,  the  ease  with  which  the  code  can  be 
embedded  for  multirate  operation,  and  the  ease  with 
which  high-frequency  regeneration  can  be  done  for 
effective  voice  excitation.  Our  plans  for  the 
future  include  testing  the  use  of  a  3-tap  pitch 
predictor,  going  to  a  256-point  block  size,  l.e., 
grouping  two  19.2  ms  frames  together  for  transform 
coding  purposes,  and,  in  general,  improving  the 
speech  quality  at  6.4  kb/s.  Finally,  we  also  plan 
to  Implement  the  system  on  the  FPS  array  processor 
AP-120B  for  real-time  operation. 
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ABSTRACT 

Me  report  on  the  initial  development 
of  a  phonetic  vocoder  operating  at  100 
b/s.  With  each  phoneme,  the  vocoder 
transmits  the  duration  and  a  single  pitch 
value.  The  synthesizer  uses  a  large 
inventory  of  diphone  templates  to 
synthesize  a  desired  phoneme  string.  To 
determine  a  phoneme  string  from  input 
speech,  the  analyzer  takes  into  account 
the  synthesis  model  by  using  the  same 
inventory  of  diphone  templates,  augmented 
by  additional  diphone  templates  to  account 
for  alternate  pronunciations.  The  phoneme 
string  is  chosen  to  minimize  the 
difference  between  the  diphone  templates 
and  the  input  speech  according  to  a 
distance  measure. 

1.  INTRODUCTION 

There  are  many  applications  for 
digital  speech  transmission  and  speech 
playback  that  require  very  low  data  rates 
on  the  order  of  100  b/s.  In  strategic 
communications,  having  a  very- low-rate 
(VL R)  capability  would  allow  low  power 
communications  to  avoid  detection. 
Alternatively,  it  would  allow  sufficient 
power  per  bit  to  reliably  "burn  through" 
jamming  networks.  Under  extreme 
atmospheric  conditions  or  in  highly 
shielded  environments,  only  VLR 
transmission  may  be  is  possible.  The  same 
technology  also  supports  speech  storage 
for  later  playback  using  space  comparable 
to  that  required  for  text.  This  would 
permit  the  storage  of  much  larger  amounts 
of  speech  in  small  devices  than  previously 
possible.  A  VLR  vocoder  would  also  allow 
the  transmission  of  spoken  messages  using 
much  the  same  mechanism  now  used  for 
computer  mail  systems,  without  requiring 
high  data  rate  connections  to  the  computer 
and  large  amounts  of  file  storage. 

In  this  paper,  we  first  describe  a 
study  performed  to  investigate  the 
feasibility  of  a  vocoder  that  operates  at 
75-100  b/s.  After  briefly  discussing  the 
results  of  this  study,  we  report  on  the 
status  of  a  current  project  involving  the 
implementation  of  a  VLR  vocoder. 


2.  FEASIBILITY  STUDY 

In  order  to  assess  the  feasibility  of 
a  phonetic  vocoder,  we  undertook,  in  1976, 
a  short  study  in  which  we  approximated  the 
conditions  of  a  75-100  b/s  phonetic 
vocoder  [1] . 

The  first  question  we  asked  ourselves 
was:  "What  should  be  the  transmission 
units  of  a  VLR  vocoder?"  We  argued  then 
that  the  transmission  unit  must  be  on  the 
order  of  a  phoneme.  At  an  average 
speaking  rate  of  about  12  phonemes  per 
second,  simply  transmitting  phonemes 
requires  about  60-75  bits.  Adding  just 
the  barest  amount  of  intonation 
information  (3  bits  total  for  each  phoneme 
duration  and  a  single  pitch  value)  brings 
the  rate  up  to  100  b/s.  Even  if  the 
transmission  units  used  by  such  a  vocoder 
were  not  phonemes  per  se,  the  spectral 
difference  between  any  two  units  would 
necessarily  be  on  the  order  of  the 
difference  between  phonemes.  Hence, 
errors  in  such  a  vocoder  would  directly 
cause  a  loss  of  intelligibility,  since 
changing  a  phoneme  could  change  the 
meaning. 

The  basic  phonetic  vocoder  that  we 


Fig.  1  Very  low  rate  (VLR)  Phonetic 
Vocoder . 
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implemented  is  shown  in  Pig.  1.  Using  the 
Acoustic-Phonetic  Recognizer  (APR) 
component  of  the  HWIM  Speech  Understanding 
System  (2]  as  the  phoneme  analyzer,  we 
produced  a  set  of  phoneme  strings  with 
associated  durations  and  pitch  values  to 
be  transmitted  to  a  phonetic  synthesizer. 
The  transmitted  pitch  values  specify  a 
piecewise  linear  function  (linear  during 
each  phoneme) .  The  function  is  determined 
by  minimizing  the  weighted  squared  error 
between  the  linear  function  and  the 
original  pitch  track.  The  synthesizer 
used  was  Dennis  Klatt's  synthesis-by-rule 
program  [3] .  The  test  was  not  performed 
under  optimal  conditions;  neither  the 
analyzer  nor  the  synthesizer  were  designed 
for  phonetic  vocoding,  nor  were  they 
completely  compatible  with  each  other. 
However,  we  felt  that  the  results  derived 
would  still  be  useful. 

To  start,  we  chose  a  subset  of 
sentences  on  which  the  APR  performed 
better  than  average,  with  phoneme 
recognition  rates  ranging  from  6S%-95%.  A 
two-way  conversation,  in  which  one  side 
was  vocoded  by  this  simulation  system  and 
the  other  was  a  natural  voice,  was  played 
to  a  panel  of  listeners.  The  panel  was 
asked  to  transcribe  the  vocoded  side  of 
the  conversation.  The  primary  result  was 
that  sentences  in  which  80%  or  more  of  the 
phonemes  were  correctly  recognized  by  the 
APR  were  transcribed  with  little 
difficulty  by  the  panel.  Those  sentences 
that  contained  more  errors  .were  largely 
unintelligible.  A  second  important  result 
was  that  a  very  rough  phoneme  duration, 
and  a  heavily  quantized  single  pitch  value 
for  each  phoneme,  were  sufficient  to 
preserve  the  original  intonation. 

We  concluded,  therefore,  that  for  a 
phonetic  vocoder  operating  at  100  b/s  to 
be  useful,  it  must  have  a  phonetic 
recognition  rate  of  at  least  80%  and 
natural  phonetic  synthesis.  Below,  we 
describe  our  first  effort  at  designing 
such  a  system. 

3.  DIPHONE  VOCODER 

We  have  recently  begun  working  on  a 
phonetic  vocoder  that  is  designed  to 
transmit  intelligible  speech  at  an  average 
rate  of  about  1 00  b/s.  The  block  diagram 
for  this  vocoder  is  the  same  as  for  the 
experimental  system  shown  in  Pig.  1.  The 
vocoder  extracts  and  transmits  a  sequence 
of  phonemes.  It  also  transmits  the 
phoneme  durations  and  a  single  pitch  value 
for  each  voiced  phoneme  in  order  to 
preserve  the  intonation  in  the  input 
speech.  However,  the  phonetic  analysis 
and  synthesis  components  of  this  vocoder 
have  been  changed  from  that  of  the 
experimental  system.  The  basic  unit  that 
we  have  chosen  for  use  in  both  the 


analyzer  and  synthesizer  is  the  diphone. 
A  diphone  is  defined  as  the  region  from 
the  middle  of  one  phoneme  to  the  middle  of 
the  next  phoneme.  The  diphone  is  a 
natural  unit  for  synthesis  because  the 
coar ticulatory  influence  of  one  phoneme 
does  not  usually  extend  much  further  than 
half  way  into  the  next  phoneme.  Since 
diphone  junctures  are  often  at 
articulatory  steady  states,  minimal 
smoothing  is  required  between  adjacent 
diphones.  Also,  the  difficult  task  of 
representing  phoneme  transitions  by 
acoustic-phonetic  rules  is  avoided. 

Below,  we  describe  the  diphone 
synthesis  and  diphone  analysis  components 
of  the  phonetic  vocoder.  At  the  present 
time,  both  programs  are  being  designed  for 
a  single  speaker. 

3.1  Diphone  Synthesis 

Since  the  design  of  the  analyzer 
depends  critically  on  the  synthesizer 
model,  we  shall  discuss  the  synthesizer 
first.  Fig.  2  shows  a  block  diagram  of 
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Fig.  2  Diphone  Synthesis  method. 

the  synthesis  program.  The  phonetic 
synthesizer  [4]  uses  the  transmitted 
phonemes,  durations  and  pitch  values,  to 
produce  a  sequence  of  control  parameters 
(LPC  parameters,  voicing,  pitch,  gain, 
cutoff  frequency  (5,61)  for  an  LPC 
synthesizer.  These  parameters  are  stored 
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in  a  large  inventory  of  templates  -  one 
for  each  diphone.  A  template  contains, 
for  each  10  ms  frame,  the  LPC  parameters 
and  a  value  of  gain.  The  LPC  parameters 
are  stored  as  Log  Area  Ratio  (LAR) 
parameters,  since  these  are  used  directly 
by  the  diphone  synthesis  program.  The 
diphone  inventory  has  been  selected  to 
account  for  many  of  the  stronger 
coarticulation  effects  in  English.  In 
particular,  the  diphone  template  inventory 
is  chosen  to  differentiate  between 
prevocalic  and  postvocalic  allophones  of 
sonorants,  to  account  for  changes  in  vowel 
color  conditioned  by  postvocalic  liquids, 
to  allow  exact  specification  of  voice 
onset  time,  and  to  permit  the  synthesis  of 
glottal  stops,  alveolar  flaps,  and 
syllabic  consonants.  The  diphone 
templates  are  extracted  from  a  carefully 
constructed  set  of  short  utterances.  The 
current  total  diphone  synthesis  inventory 
is  approximately  2650  diphones. 

The  first  step  in  the  synthesiser  is 
to  translate  the  input  phoneme  sequence 
into  a  diphone  sequence.  In  some  cases, 
there  are  multiple  diphone  templates  for  a 
single  phoneme  pair.  The  template  used  is 
determined  by  the  surrounding  phonetic 
context.  Each  of  the  diphone  templates  is 
warped  in  time  so  that  the  input  phoneme 
duration  requirements  are  satisfied.  The 
time-warping  takes  into  account  the 
relative  inelasticity  of  the  phoneme 
transition  region.  Next,  the  program 
smooths  between  consecutive  diphone 
templates  to  minimize  gain  and  spectral 
discontinuities.  The  smoothing  algorithm 
is  designed  to  preserve  the  original 
parameter  tracks  intact  where  possible. 
Continuous  pitch  tracks  are  reconstructed 
by  linear  interpolation  of  the  sequence  of 
single  pitch  values,  one  for  each  phoneme. 
The  cutoff  frequency  and  voicing  flag  for 
each  frame  are  determined  by  rule  from  the 
phonemes  being  synthesized.  The  resulting 
continuous  parameter  tracks  (specified 
every  10  ms)  are  used  to  control  an  LPC 
speech  synthesizer. 

3.2  Diphone  Analysis 

The  analyzer  takes  into  account  the 
synthesis  model  by  using  a  network  of 
diphone  templates  to  recognize  the 
sequence  of  phonemes  in  the  input  speech. 
The  diphone  network  consists  of  nodes  and 
directed  arcs.  An  example  of  a  simple 
network  is  shown  in  Pig.  3.  There  are  two 
types  of  nodes:  phoneme  nodes  and 
spectrum  nodes.  The  phoneme  nodes  (shown 
as  labelled  circles)  correspond  to  the 
midpoints  of  the  phonemes;  there  is  one 
such  node  for  each  phoneme.  These  phoneme 
nodes  are  connected  by  d lphone  templates. 
Each  diphone  template  is  represented  in 
the  network  as  a  sequence  of  spectrum 
nodes  (shown  as  dots) .  When  two  or  more 


Fig.  3  Example  of  Diphone  Template  Network 
for  Four  Phonemes. 

consecutive  spectra  in  the  original 
diphone  template  are  very  similar,  they 
are  represented  by  a  single  spectrum  node 
in  the  network.  The  open  dots  indicate 
the  first  spectrum  node  in  the  original 
diphone  template  that  is  at  or  past  the 
labelled  phoneme  boundary.  Note  that,  in 
Fig.  3,  the  diphone  template  P1-P2  is 
distinct  from  the  template  P2-P1.  Also 
note  the  possibility  of  diphones  of  the 
type  Pl-Pl.  The  network  allows  for  two  or 
more  templates  going  from  one  phoneme  to 
another  (e.g.,  P2-P1)  .  Branching  and 
merging  of  paths  within  a  template  is  also 
allowed  (e.g.,  P1-P3) .  Finally,  the 
network  allows  the  specification  of 
diphones  in  context.  The  node  P4/&P3 
represents  the  phoneme  P4  followed  only  by 
P3.  Thus  the  template  P2-P4/SP3  can  be 
different  from  the  unconditioned  template 
P2-P4.  The  generation  and  training  of  the 
network  is  discussed  further  in  Section 
3.3.  The  analyzer  chooses  the  sequence  of 
templates  that  best  matches  the  input 
speech  according  to  a  distance  measure. 
Since  the  received  speech  is  spectrally 
close  to  the  original  it  is  hoped  that 
this  procedure  will  suffer  minimally  from 
phoneme  recognition  errors. 

The  network  matcher  uses  a  dynamic 
programming  algorithm  which  attempts  to 
find  the  sequence  of  templates  in  the 
network  that  best  matches  the  input.  The 
basic  operation  of  the  program  begins  by 
updating  each  "theory"  by  the  addition  of 
the  newest  input  frame.  A  theory  consists 
of  a  detailed  account  of  how  a  sequence  of 
input  frames  is  aligned  with  the  network, 
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along  with  a  total  scoce  for  that 
correspondence.  Since  the  program  allows 
more  than  one  input  frame  to  be  aligned 
with  a  single  spectrum  node,  each  theory 
is  replaced  by  at  least  two  new 
theories:  one  that  corresponds  to  the  new 
input  frame  being  aligned  with  the  same 
node  as  the  previous  frame,  and  one  for 
each  of  the  nodes  immediately  following 
that  most  recent  node.  The  program  then 
discards  all  but  the  best  n  theories, 
where  n  is  fixed. 

To  decide  on  the  phonemes  to 
transmit,  the  program  examines  all  the 
remaining  theories  after  updating  each 
theory  with  the  newest  input  frame.  If 
all  the  theories  have  a  common  beginning, 
the  phonemes  corresponding  to  this  common 
beginning  are  transmitted.  At  this  early 
stage  in  the  research,  we  find  that 
preserving  a  few  hundred  theories  for  each 
frame  essentially  guarantees  that  the 
program  will  find  the  globally  optimum 
path  through  the  network.  As  the  template 
statistics  are  updated  through  training, 
we  hope  that  the  number  of  theories 
required  for  this  "bounded  breadth  search" 
will  decrease  somewhat.  Thus,  the 
analyzer  requires  a  variable  lag  between 
the  input  and  the  transmitted  output.  The 
program  can  also  be  given  a  maximum  delay 
parameter  which  forces  a  decision  if  the 
transmission  delay  exceeds  a  threshold. 
These  two  features  allow  the  analyzer  to 
operate  in  a  continuous  mode,  rather  than 
on  one  sentence  at  a  time. 

The  spectral  distance  measure 
currently  used  by  the  analyzer  in  this 
vocoder  i3  a  simple  weighted  Euclidean 
distance  between  the  Log-Area-Ratio  (LAR) 
vectors  of  the  template  and  the  input. 
Other  distance  measures  will  be  used  as 
needed. 

3.3  Network  Generation  and  Training 

Initially,  the  network  is  generated 
by  duplicating  all  the  templates  used  by 
the  diphone  synthesizer.  Then,  the 
network  is  augmented  by  the  addition  of 
new  paths  as  a  result  of  extensive 
training.  This  allows  the  network  to 
represent  a  variety  of  pronunciations  for 
each  diphone. 

The  analyzer  allows  the  researcher  to 
train  the  network  on  new  speech 
interactively.  The  researcher  guides  the 
matcher  through  the  correct  phoneme 
sequence.  Once  the  program  has  aligned 
the  input  speech  with  the  network,  the 
researcher  then  instructs  the  program  to 
use  the  aligned  speech  to  update  the 
templates  involved.  If  the  input  is 
substantially  different  from  a  region  of 
the  network,  the  program  can  add  new  paths 
corresponding  to  partial  or  complete 


templates.  In  this  way,  the  network  will 
eventually  contain  a  variety  of  different 
acoustic  realizations  for  each  diphone. 

4 .  SUMMARY 

We  have  described  the  initial 
experiments  and  current  work  leading  to 
the  development  of  a  very-low-rate 
phonetic  vocoder  that  operates  at  around 
100  b/s.  The  diphone  synthesizer  for  a 
single  speaker  is  complete  and,  given  the 
correct  phoneme  sequence,  produces  highly 
intelligible  speech.  An  initial  version 
of  the  diphone  analyzer  has  been  designed 
and  implemented,  but  will  need  extensive 
training  before  conclusions  can  be  drawn. 

Finally,  once  the  analysis  component 
of  the  VLR  vocoder  is  operating  at  a  level 
of  performance  sufficient  for  intelligible 
communications,  it  would  also  likely  lead 
to  the  design  of  more  advanced  automatic 
speech  recognition  systems. 
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